WEBVTT

00:00:00.030 --> 00:00:02.430
The following content is
provided under a Creative

00:00:02.430 --> 00:00:03.850
Commons license.

00:00:03.850 --> 00:00:06.920
Your support will help MIT
OpenCourseWare continue to

00:00:06.920 --> 00:00:10.560
offer high quality educational
resources for free.

00:00:10.560 --> 00:00:13.410
To make a donation or view
additional materials from

00:00:13.410 --> 00:00:17.510
hundreds of MIT courses, visit
MIT OpenCourseWare at

00:00:17.510 --> 00:00:18.760
ocw.mit.edu.

00:00:21.270 --> 00:00:22.890
PROFESSOR RABBAH: OK, so today's
the last lecture day

00:00:22.890 --> 00:00:25.150
we're going to talk about
the raw architecture.

00:00:25.150 --> 00:00:27.730
This is a processor that was
built here at MIT and

00:00:27.730 --> 00:00:30.650
essentially trailblazed a lot
of the research in terms of

00:00:30.650 --> 00:00:35.190
parallel architectures for
multicores, compilation for

00:00:35.190 --> 00:00:37.160
multicores, programming
language and so on.

00:00:37.160 --> 00:00:39.950
So you've heard some things
about RAW and the

00:00:39.950 --> 00:00:42.200
parallelizing technology
in terms of StreamIt.

00:00:42.200 --> 00:00:45.010
So we're going to cover some of
that again here today just

00:00:45.010 --> 00:00:48.460
briefly and give you little
bit more insight into what

00:00:48.460 --> 00:00:51.240
went into the design of
the raw architecture.

00:00:51.240 --> 00:00:54.150
So these are RAW chips
they were delivered

00:00:54.150 --> 00:00:56.040
in October of 2002.

00:00:56.040 --> 00:00:59.920
Each one of these has
16 processors on it.

00:00:59.920 --> 00:01:04.040
I'm going to show you sort of
a diagram on the next slide.

00:01:04.040 --> 00:01:05.900
It's really a tiled
microprocessor.

00:01:05.900 --> 00:01:09.550
We'll get into what that means
and how it actually--

00:01:09.550 --> 00:01:11.930
what does a tiled microprocessor
give you that

00:01:11.930 --> 00:01:14.210
makes it an attractive
design point in the

00:01:14.210 --> 00:01:16.130
architecture space?

00:01:16.130 --> 00:01:18.640
Each of the raw tiles-- you can
sort of see the outline

00:01:18.640 --> 00:01:21.790
here sort of replicates--

00:01:21.790 --> 00:01:23.060
is four millimeters.

00:01:23.060 --> 00:01:25.800
It's four millimeters square.

00:01:25.800 --> 00:01:28.680
It's a single-issue
8-stage pipeline.

00:01:28.680 --> 00:01:32.820
It has local memory, so
there's a 32K cache.

00:01:32.820 --> 00:01:36.230
And the unique aspect of the raw
processor is that is has a

00:01:36.230 --> 00:01:38.980
lot of on-chip networks that you
could use to orchestrate

00:01:38.980 --> 00:01:41.210
communication between
processors.

00:01:41.210 --> 00:01:43.080
So there's two operand
networks.

00:01:43.080 --> 00:01:44.350
I'm going to get into
what that means and

00:01:44.350 --> 00:01:45.700
what they used for.

00:01:45.700 --> 00:01:47.480
But these eventually allow
you to do point-to-point

00:01:47.480 --> 00:01:50.260
communication between tiles
with very low latency.

00:01:50.260 --> 00:01:53.280
And then there's a network that
essentially allows you to

00:01:53.280 --> 00:01:56.770
handle cache misses and input
and output and one for message

00:01:56.770 --> 00:02:00.850
passings, a more dynamic style
of messaging, something

00:02:00.850 --> 00:02:03.850
similar to what you're
accustomed to at the cell, for

00:02:03.850 --> 00:02:06.440
example, in DMA transfers.

00:02:06.440 --> 00:02:09.570
This was built in 180
nanometer ASIC

00:02:09.570 --> 00:02:10.510
technology by IBM.

00:02:10.510 --> 00:02:12.810
It's got 100 million
transistors.

00:02:12.810 --> 00:02:16.140
It was designed here by
MIT grad students.

00:02:16.140 --> 00:02:20.240
It's got something like
a million gates on it.

00:02:20.240 --> 00:02:22.270
Three to four years of
development time.

00:02:22.270 --> 00:02:23.940
And what was really interesting
here is that this

00:02:23.940 --> 00:02:27.590
was-- because of the tiled
nature of the architecture,

00:02:27.590 --> 00:02:31.050
you could just design one tile
and then once you have one

00:02:31.050 --> 00:02:32.850
tile, you essentially just plop
down more and more and

00:02:32.850 --> 00:02:33.590
more of them.

00:02:33.590 --> 00:02:36.710
And so you have one, you scale
it out to 16 tiles.

00:02:36.710 --> 00:02:39.680
And the design sort of came back
without any bugs when the

00:02:39.680 --> 00:02:42.010
first chip was delivered.

00:02:42.010 --> 00:02:45.760
The core frequency was expected
to run at 425--

00:02:45.760 --> 00:02:49.870
I think lower than
425 megahertz.

00:02:49.870 --> 00:02:52.430
AUDIENCE: Designed for 250?

00:02:52.430 --> 00:02:54.460
PROFESSOR RABBAH: 250 megahertz
and came back and it

00:02:54.460 --> 00:02:56.350
ran 425 megahertz.

00:02:56.350 --> 00:03:01.690
And it's been clocked as high as
500 megahertz at 2.2 volts.

00:03:01.690 --> 00:03:05.000
The chip isn't really designed
for low power but the tile

00:03:05.000 --> 00:03:08.740
abstraction is really nice for
power consumption because if

00:03:08.740 --> 00:03:10.160
you're not using tiles
you can essentially

00:03:10.160 --> 00:03:11.230
just shut them down.

00:03:11.230 --> 00:03:15.920
So it'll allow you to sort of
have power efficient design

00:03:15.920 --> 00:03:17.760
just by nature of the
architecture.

00:03:17.760 --> 00:03:20.340
But when you're using all the
tiles, all the memories, all

00:03:20.340 --> 00:03:25.330
the networks, in a non-optimized
design, you

00:03:25.330 --> 00:03:29.260
consume about 18
watts of power.

00:03:29.260 --> 00:03:32.380
So how do you use this
tiled processor?

00:03:32.380 --> 00:03:34.400
So here's one particular
example.

00:03:34.400 --> 00:03:37.000
The nice thing about tile
architecture is that you can

00:03:37.000 --> 00:03:40.470
let applications consume as
many tiles as they need.

00:03:40.470 --> 00:03:42.620
If you have an application with
a lot parallelism then

00:03:42.620 --> 00:03:44.440
you give it a lot of tiles.

00:03:44.440 --> 00:03:46.110
If you have an application that
doesn't need a lot of

00:03:46.110 --> 00:03:48.150
parallelism then you don't
give it a lot of tiles.

00:03:48.150 --> 00:03:51.540
So it allows you to really
exploit the mapping of your

00:03:51.540 --> 00:03:54.350
application down to the
architecture and gives you

00:03:54.350 --> 00:03:56.990
ASIC-like behavior-- application
specific

00:03:56.990 --> 00:03:59.020
processing technology.

00:03:59.020 --> 00:04:02.170
So one example is you have
some video that you're

00:04:02.170 --> 00:04:05.355
recording and you want to encode
it and stream it across

00:04:05.355 --> 00:04:08.410
the web or display it on your
monitor or whatever else.

00:04:08.410 --> 00:04:10.910
So you can have some logic
that you map down.

00:04:10.910 --> 00:04:13.330
If your chips are here, you
do some computation.

00:04:13.330 --> 00:04:15.940
You have memories sprinkled
across the tile that you're

00:04:15.940 --> 00:04:18.090
going to use for local store.

00:04:18.090 --> 00:04:23.060
So you can parallelize, for
example, the motion estimation

00:04:23.060 --> 00:04:28.940
for encoding the temporal
redundancy in a video stream.

00:04:28.940 --> 00:04:31.200
You can have another application
completely

00:04:31.200 --> 00:04:34.470
independent running on other
part of the chip.

00:04:34.470 --> 00:04:36.250
So here's an application that's
using four different

00:04:36.250 --> 00:04:37.460
tiles and it's really
isolated.

00:04:37.460 --> 00:04:40.450
It doesn't affect what's going
on in these tiles.

00:04:40.450 --> 00:04:42.500
You can have another application
that's running

00:04:42.500 --> 00:04:44.590
something like MPI where
you're doing dynamic

00:04:44.590 --> 00:04:48.500
messaging, and httpd server
and this tile is maybe not

00:04:48.500 --> 00:04:51.580
used so it's just sleeping
or it's idle.

00:04:51.580 --> 00:04:52.930
You can have memories
connected off

00:04:52.930 --> 00:04:55.010
the chip, I/O devices.

00:04:55.010 --> 00:04:58.830
So it's really interesting in
the sense that probably the

00:04:58.830 --> 00:05:01.610
most interesting aspect of it is
you just allow the tiles to

00:05:01.610 --> 00:05:04.180
sort of be used as your
fundamental resource.

00:05:04.180 --> 00:05:04.970
And you can scale
them up as your

00:05:04.970 --> 00:05:07.650
application parallelism scales.

00:05:07.650 --> 00:05:10.580
This is a picture of the raw
board-- the raw motherboard.

00:05:10.580 --> 00:05:13.870
Actually you see it in the Stata
Center in the Raw Lab.

00:05:13.870 --> 00:05:15.530
This is the raw chip.

00:05:15.530 --> 00:05:18.820
A lot of the peripheral
device, firmware and

00:05:18.820 --> 00:05:23.630
interconnect for dealing with a
lot of devices off the chip

00:05:23.630 --> 00:05:25.030
are implemented in
these FPGAs, so

00:05:25.030 --> 00:05:27.390
these are Xilinx chips.

00:05:27.390 --> 00:05:28.570
There's DRAM.

00:05:28.570 --> 00:05:33.980
You have connection to a
PCI card, USB stick.

00:05:33.980 --> 00:05:36.560
Network interface so you can
actually log into this machine

00:05:36.560 --> 00:05:37.810
and use it.

00:05:40.600 --> 00:05:42.760
And there's a real compiler.

00:05:42.760 --> 00:05:45.890
It can run real applications
on it.

00:05:45.890 --> 00:05:49.660
There's actually a bigger chip
that we built where we take

00:05:49.660 --> 00:05:52.800
four of these raw chips and
sort of scale them up.

00:05:52.800 --> 00:05:55.350
So rather than having 16 tiles
on your motherboard, you can

00:05:55.350 --> 00:05:56.400
have four raw chips.

00:05:56.400 --> 00:05:57.800
That gives you 64 tiles.

00:05:57.800 --> 00:06:00.160
You can scale this up to a
thousand tiles or so on.

00:06:00.160 --> 00:06:02.120
Just because of the tile
nature, everything is

00:06:02.120 --> 00:06:04.025
symmetric, homogeneous, so
you can really scale

00:06:04.025 --> 00:06:06.500
it up really big.

00:06:06.500 --> 00:06:08.260
So what is the performance
of raw?

00:06:08.260 --> 00:06:12.370
So looking at the overall
application performance, so

00:06:12.370 --> 00:06:13.710
we've done a lot of
benchmarking.

00:06:13.710 --> 00:06:16.120
So these are numbers from a
paper that was published in

00:06:16.120 --> 00:06:19.530
2004, where we took a lot of
applications-- some are

00:06:19.530 --> 00:06:21.790
well-known and used in standard
benchmark suites--

00:06:21.790 --> 00:06:25.800
and compiled them for raw using
various raw compiler

00:06:25.800 --> 00:06:27.750
that we built in-house.

00:06:27.750 --> 00:06:30.170
And we've compared them
against the Pentium 3.

00:06:30.170 --> 00:06:33.170
So the Pentium 3 is sort of
a unique comparison point

00:06:33.170 --> 00:06:36.580
because it sort of matches raw
in terms of the technology

00:06:36.580 --> 00:06:38.890
that was used to fabricate
the two.

00:06:38.890 --> 00:06:41.310
And what you're seeing here,
this is a lock scale.

00:06:41.310 --> 00:06:45.460
The speedup of the application
running on raw compared to the

00:06:45.460 --> 00:06:47.110
application running on a P3.

00:06:47.110 --> 00:06:50.660
So the higher you get, the
better the performance is.

00:06:50.660 --> 00:06:55.770
So these applications are sort
of grouped into a few classes.

00:06:55.770 --> 00:06:58.500
So the first class is what
we call ILP applications.

00:06:58.500 --> 00:07:01.620
So these are applications that
have essentially instruction

00:07:01.620 --> 00:07:02.730
level parallelism.

00:07:02.730 --> 00:07:04.680
I'm going to talk a little
bit more about and

00:07:04.680 --> 00:07:05.450
sort of explain it.

00:07:05.450 --> 00:07:08.040
But you've seen this early
on in the lecture--

00:07:08.040 --> 00:07:10.620
in some of Saman's lectures.

00:07:10.620 --> 00:07:13.020
So here you're trying to exploit
inherent instruction

00:07:13.020 --> 00:07:14.820
level parallelism in
the applications.

00:07:14.820 --> 00:07:17.700
And if you have lots of ILP then
you map it to a lot of

00:07:17.700 --> 00:07:20.040
tiles and you can get
parallelism that way and you

00:07:20.040 --> 00:07:22.110
get better performance.

00:07:22.110 --> 00:07:24.850
These applications here-- what
we call the streaming

00:07:24.850 --> 00:07:25.530
applications.

00:07:25.530 --> 00:07:30.340
So you saw some of these in the
StreamIt lecture and the

00:07:30.340 --> 00:07:32.250
StreamIt parallelizer
compiler.

00:07:32.250 --> 00:07:34.100
Some of those numbers were
generated on a raw-like

00:07:34.100 --> 00:07:35.830
architecture.

00:07:35.830 --> 00:07:39.130
And then you have the server
or sort of more traditional

00:07:39.130 --> 00:07:42.650
applications that you expect
to run in a server style or

00:07:42.650 --> 00:07:45.230
throughput-oriented.

00:07:45.230 --> 00:07:47.260
And then finally you have
bit-level applications.

00:07:47.260 --> 00:07:50.750
So doing things at the very
lowest level of computation

00:07:50.750 --> 00:07:52.900
where you're doing a lot
of bit manipulation.

00:07:52.900 --> 00:07:57.830
So what's interesting here to
note is that as you get into

00:07:57.830 --> 00:08:00.760
more applications that have a
lot of inherent parallelism in

00:08:00.760 --> 00:08:03.830
them, where you want
explicit--

00:08:03.830 --> 00:08:06.270
where you can extract a lot of
parallelism because of the

00:08:06.270 --> 00:08:07.940
explicit nature of the
applications--

00:08:07.940 --> 00:08:09.920
you can really map those
really well through the

00:08:09.920 --> 00:08:10.630
architecture.

00:08:10.630 --> 00:08:13.850
And because of the communication
nature--

00:08:13.850 --> 00:08:15.650
because of communication
capabilities of the

00:08:15.650 --> 00:08:18.810
architecture, being able to
stream data from one tile to

00:08:18.810 --> 00:08:22.350
another really fast, you can
get really high on-chip

00:08:22.350 --> 00:08:24.130
bandwidth and that gives you
really high performance,

00:08:24.130 --> 00:08:26.380
especially for these kinds
of applications.

00:08:30.120 --> 00:08:32.550
There are other applications
that we've done.

00:08:32.550 --> 00:08:34.800
Some of the students have worked
on in the raw group.

00:08:34.800 --> 00:08:39.040
So an MPEG-2 encoder where
you're essentially trying to

00:08:39.040 --> 00:08:42.450
do real-time encoding of a
video screen at different

00:08:42.450 --> 00:08:42.930
resolutions.

00:08:42.930 --> 00:08:47.925
So 350 by 240 or 720 by 480
where you're compiling down to

00:08:47.925 --> 00:08:49.640
a number of tiles.

00:08:49.640 --> 00:08:52.360
One, 4, 8 sixteen, 16--

00:08:52.360 --> 00:08:55.410
1 and 16 are somehow missing,
I'm not sure why.

00:08:55.410 --> 00:08:57.150
And what you're looking
for here?

00:08:57.150 --> 00:08:59.340
Sort of scalability
of algorithm.

00:08:59.340 --> 00:09:02.390
As you add more tiles, are
you getting more and more

00:09:02.390 --> 00:09:04.780
performance or are you getting
better and better throughput?

00:09:04.780 --> 00:09:08.610
So you could encode more frames
per second for example.

00:09:08.610 --> 00:09:12.610
So if you're doing HDTV, it's
1080p, then you really want to

00:09:12.610 --> 00:09:13.720
sort of get--

00:09:13.720 --> 00:09:16.800
there's a lot of compute
power that you need.

00:09:16.800 --> 00:09:19.570
And so as you add more frames,
maybe you can get to sort of

00:09:19.570 --> 00:09:23.750
the throughput that
you need for HDTV.

00:09:23.750 --> 00:09:25.450
So this is something that might
be interesting for some

00:09:25.450 --> 00:09:26.460
of your projects as well.

00:09:26.460 --> 00:09:28.610
And we've talked about
this before.

00:09:28.610 --> 00:09:31.650
On the cell, as you're using
more and more SPEs, can you

00:09:31.650 --> 00:09:34.190
accelerate the performance
of your application?

00:09:34.190 --> 00:09:35.420
Can you sort of show
that if you're

00:09:35.420 --> 00:09:36.640
doing some visual aspect?

00:09:36.640 --> 00:09:38.330
And you can sort of
demonstrate it.

00:09:38.330 --> 00:09:42.060
So there's a demo that is set
up and in the lab where you

00:09:42.060 --> 00:09:44.703
can sort of crank up number of
tiles that you're using and

00:09:44.703 --> 00:09:47.850
you get better performance
from the MPEG encoder.

00:09:47.850 --> 00:09:50.300
And just looking at number of
frames per second that you can

00:09:50.300 --> 00:09:55.520
get, with 64 tiles-- so the raw
chip is 16 tiles, but you

00:09:55.520 --> 00:09:58.800
can scale it up by having
more chips--

00:09:58.800 --> 00:10:00.920
so you can get about
51 frames.

00:10:00.920 --> 00:10:03.230
These numbers have been
improved and there are

00:10:03.230 --> 00:10:08.280
different ways of optimizing
these performances.

00:10:08.280 --> 00:10:14.990
At 352 by 4 240, the estimated
data rate-- estimated

00:10:14.990 --> 00:10:15.210
throughput--

00:10:15.210 --> 00:10:17.585
of 160 frames per second
almost. So this

00:10:17.585 --> 00:10:20.790
is really high bandwidth.

00:10:20.790 --> 00:10:23.790
Another interesting thing that
we've done with the raw chip

00:10:23.790 --> 00:10:27.330
is taking a look at graphics
pipelines and looking at is

00:10:27.330 --> 00:10:30.740
there anything we can do to
exploit the inherent tiled

00:10:30.740 --> 00:10:32.700
architecture of the raw chip.

00:10:32.700 --> 00:10:36.190
So here's a screenshot from
Counterstrike and a simplified

00:10:36.190 --> 00:10:38.930
graphics pipeline where you have
some input to the screen

00:10:38.930 --> 00:10:39.860
you want to render.

00:10:39.860 --> 00:10:41.280
You do some vertex shading.

00:10:41.280 --> 00:10:43.990
So these are triangles that you
want to figure out what

00:10:43.990 --> 00:10:45.810
colors to make--

00:10:45.810 --> 00:10:47.730
what colors to paint them.

00:10:47.730 --> 00:10:50.210
The triangle's set up
for pixel stage.

00:10:50.210 --> 00:10:53.060
And in this screen you'll notice
that there are two

00:10:53.060 --> 00:10:54.550
different things that
you're rendering.

00:10:54.550 --> 00:10:57.080
There's essentially this part of
the screen which has a lot

00:10:57.080 --> 00:11:00.070
of triangles that span
a relatively

00:11:00.070 --> 00:11:03.470
not-so-complex image.

00:11:03.470 --> 00:11:06.380
And then you have these guys
here that have fewer triangle

00:11:06.380 --> 00:11:12.520
span a smaller region
of the frame.

00:11:12.520 --> 00:11:14.960
And what you might want to do
is allocate more computer

00:11:14.960 --> 00:11:17.960
power to the pixel stage and
less compute power to the

00:11:17.960 --> 00:11:18.850
vertex stage.

00:11:18.850 --> 00:11:21.660
So that's analogous to saying,
I want more tiles for one

00:11:21.660 --> 00:11:24.260
stage of the pipeline and
fewer tiles for another.

00:11:24.260 --> 00:11:27.040
Or maybe I want to be able to
dynamically change how many

00:11:27.040 --> 00:11:28.430
tiles I'm allocating
at different

00:11:28.430 --> 00:11:30.200
stages of the pipeline.

00:11:30.200 --> 00:11:33.580
So that as your screens that
you're rendering change in

00:11:33.580 --> 00:11:38.120
terms of their complexity, you
can maintain the good visual

00:11:38.120 --> 00:11:43.950
illusions transparently without
compromising the

00:11:43.950 --> 00:11:45.500
utilization of the chip.

00:11:45.500 --> 00:11:47.560
So some demos that were
done with the

00:11:47.560 --> 00:11:49.230
graphics group it at MIT--

00:11:49.230 --> 00:11:51.250
Fredo Durand's group--

00:11:51.250 --> 00:11:52.080
phong shading.

00:11:52.080 --> 00:11:55.790
You have 132 vertices
with 1 light source.

00:11:55.790 --> 00:11:57.350
So this is what you're
trying to shade.

00:11:57.350 --> 00:12:00.410
You have a lot of
regions black.

00:12:00.410 --> 00:12:04.110
So if you're looking at a
fixed pipeline where the

00:12:04.110 --> 00:12:06.650
vertex shader is taking
six tiles-- this is

00:12:06.650 --> 00:12:08.120
on a 64-tile chip--

00:12:08.120 --> 00:12:10.920
the rasterizer is taking 15
tiles, the pixel processor has

00:12:10.920 --> 00:12:15.150
15 tiles, the alpha buffer
operations has 15 tiles, then

00:12:15.150 --> 00:12:18.310
you might not get the best
utilization because for that

00:12:18.310 --> 00:12:20.910
entire region that you're
rendering where it's black

00:12:20.910 --> 00:12:23.920
there's nothing really
interesting happening there.

00:12:23.920 --> 00:12:27.310
You want to shift those tiles
to another processor, to

00:12:27.310 --> 00:12:28.770
another stage of pipeline.

00:12:28.770 --> 00:12:31.160
Or, if you can't really utilize
them, then you're just

00:12:31.160 --> 00:12:33.750
wasting power, wasting energy,
and so you might just want to

00:12:33.750 --> 00:12:36.020
shut them and not
use them at all.

00:12:36.020 --> 00:12:38.120
So with a fixed pipeline
versus a reconfigurable

00:12:38.120 --> 00:12:42.190
pipeline where I can change the
number of tiles allocated

00:12:42.190 --> 00:12:44.660
to different stages of the
pipeline, I can get better

00:12:44.660 --> 00:12:46.430
utilization.

00:12:46.430 --> 00:12:49.250
And, in some cases, better
performance.

00:12:49.250 --> 00:12:51.060
So here, fuller bars,
and you're

00:12:51.060 --> 00:12:53.540
finishing faster in time.

00:12:53.540 --> 00:12:56.540
So this is indicative also
of what's going on in the

00:12:56.540 --> 00:12:57.620
graphics industry.

00:12:57.620 --> 00:12:59.930
So the graphics card
used to be very--

00:13:04.990 --> 00:13:07.930
well, it had fixed resources
allocated to different stage,

00:13:07.930 --> 00:13:10.900
which is essentially what we're
trying model in this

00:13:10.900 --> 00:13:13.830
part of the experiment, where
more and more now you have

00:13:13.830 --> 00:13:16.300
unified shaders that you can use
for the pixel shading and

00:13:16.300 --> 00:13:16.870
the vertex shading.

00:13:16.870 --> 00:13:19.230
So you're getting into more of
that programmable aspect.

00:13:19.230 --> 00:13:21.660
Precisely because you want to
be able to do this kind of

00:13:21.660 --> 00:13:24.870
load balancing and exploit
dynamisms that you see in

00:13:24.870 --> 00:13:26.530
different things that you're
trying to render.

00:13:29.150 --> 00:13:31.270
Another example:
shadow volumes.

00:13:31.270 --> 00:13:35.170
You have 4 triangles,
one light source.

00:13:35.170 --> 00:13:37.100
And this was rendered
in three passes.

00:13:37.100 --> 00:13:41.285
So pass 1, pass 2, pass 3, would
essentially take the

00:13:41.285 --> 00:13:45.360
same amount of time because
you're doing the same

00:13:45.360 --> 00:13:48.600
computation map to a fixed
number of resources.

00:13:48.600 --> 00:13:50.830
But if I can change the number
of resources that I need for

00:13:50.830 --> 00:13:53.527
different passes-- so the
rasterizer, for example, and

00:13:53.527 --> 00:13:55.690
the alpha buffer operations,
is really where you

00:13:55.690 --> 00:13:56.580
need a lot of power.

00:13:56.580 --> 00:14:01.910
So if you go from 15 tiles for
each to 20 tiles for each, you

00:14:01.910 --> 00:14:04.370
get better execution time
because you were able to

00:14:04.370 --> 00:14:06.820
exploit parallelism or match
parallelism better to the

00:14:06.820 --> 00:14:07.740
application.

00:14:07.740 --> 00:14:09.800
And so you get 40%
percent faster in

00:14:09.800 --> 00:14:11.050
this particular case.

00:14:13.460 --> 00:14:16.350
And another interesting
application: this is the

00:14:16.350 --> 00:14:19.200
largest in the world
microphone array.

00:14:19.200 --> 00:14:21.580
It's actually in the Guinness
Book of Records.

00:14:21.580 --> 00:14:23.620
It was build in the lab.

00:14:23.620 --> 00:14:27.140
And what it essentially has--
each of these little boards

00:14:27.140 --> 00:14:28.780
has two microphones on it.

00:14:28.780 --> 00:14:30.850
And so what you can
use this for is

00:14:30.850 --> 00:14:32.050
eavesdropping for example.

00:14:32.050 --> 00:14:35.820
Or you can carry this
around if you want.

00:14:35.820 --> 00:14:38.720
Pack it in the car and
do some spying.

00:14:38.720 --> 00:14:42.720
But somewhat more interesting
demos that were done with this

00:14:42.720 --> 00:14:45.910
in smaller scales was that in a
noisy room, for example, if

00:14:45.910 --> 00:14:47.770
you want the sort of hone in.

00:14:47.770 --> 00:14:51.280
Let's say everybody here was
speaking, but for the camera

00:14:51.280 --> 00:14:52.790
they want to record
only my voice.

00:14:52.790 --> 00:14:56.170
They can have a microphone array
in the back that focuses

00:14:56.170 --> 00:14:57.370
on just my voice.

00:14:57.370 --> 00:15:01.190
And the way it's done is you can
measure the distance from

00:15:01.190 --> 00:15:03.427
the time it takes for the sound
wave to reach each of

00:15:03.427 --> 00:15:05.930
these different microphones
and you can focus in on a

00:15:05.930 --> 00:15:10.950
particular source of noise
and be able to

00:15:10.950 --> 00:15:13.510
just highlight that.

00:15:13.510 --> 00:15:15.380
So there's this demo where's
it's a noisy room--

00:15:15.380 --> 00:15:18.470
I probably should have had these
in here in retrospect--

00:15:18.470 --> 00:15:21.750
there's a noisy room, lots of
people are talking, then you

00:15:21.750 --> 00:15:24.180
turn on the microphone array
and you can hear that one

00:15:24.180 --> 00:15:26.000
particular source and
it's a lot clearer.

00:15:26.000 --> 00:15:30.740
You can also have applications
where you're tracking a person

00:15:30.740 --> 00:15:33.060
in a room with videos as
well, so you can sort

00:15:33.060 --> 00:15:34.850
of follow him around.

00:15:34.850 --> 00:15:36.340
So it's a very interesting
application.

00:15:36.340 --> 00:15:39.620
An now I regret not having
the video demo in here.

00:15:39.620 --> 00:15:40.200
Actually, should I do it?

00:15:40.200 --> 00:15:41.550
It's on the Web.

00:15:41.550 --> 00:15:42.890
OK.

00:15:42.890 --> 00:15:45.990
So a case study using
the beamformer.

00:15:45.990 --> 00:15:49.290
So what's being done in the
microphone array is you're

00:15:49.290 --> 00:15:50.130
doing beamforming.

00:15:50.130 --> 00:15:53.270
So you're trying to figure out
what are the different beams

00:15:53.270 --> 00:15:54.290
that are reaching
the microphone.

00:15:54.290 --> 00:15:57.280
You want to be able to
amplify one of them.

00:15:57.280 --> 00:16:02.650
So looking at the application
written natively in C running

00:16:02.650 --> 00:16:06.050
on a 1 gigahertz Pentium
, what is the operation

00:16:06.050 --> 00:16:06.620
throughput?

00:16:06.620 --> 00:16:10.470
So you're getting about
240 MegaFLOPS.

00:16:10.470 --> 00:16:14.520
And if you go down
to an optimized--

00:16:14.520 --> 00:16:17.700
same code but running on single
tile raw chip, you get

00:16:17.700 --> 00:16:19.190
about 19 MegaFLOPS.

00:16:19.190 --> 00:16:20.480
So a not very good
performance.

00:16:20.480 --> 00:16:22.530
But here, what you really want
to do, is you have al lot of

00:16:22.530 --> 00:16:23.200
parallelism.

00:16:23.200 --> 00:16:25.580
Because each of those beams
that's reaching individual

00:16:25.580 --> 00:16:27.660
microphones can be
done in parallel.

00:16:27.660 --> 00:16:29.170
So you have a lot of
parallelism in that

00:16:29.170 --> 00:16:29.830
application.

00:16:29.830 --> 00:16:33.350
So taking the C program,
reimplementing it in StreamIt

00:16:33.350 --> 00:16:36.060
that you've seen in previous
lectures, and not really

00:16:36.060 --> 00:16:38.180
optimizing it in terms
of doing a lot of the

00:16:38.180 --> 00:16:43.360
optimizations you saw in the
parallelizing compiler talk,

00:16:43.360 --> 00:16:44.600
you get about 640 MegaFLOPS.

00:16:44.600 --> 00:16:49.810
So already you're beating the C
program running on a pretty

00:16:49.810 --> 00:16:51.800
fast superscalar machine.

00:16:51.800 --> 00:16:54.240
And if you really optimize the
StreamIt code in terms of

00:16:54.240 --> 00:16:59.030
doing the fission and fusion,
increasing the parallelism,

00:16:59.030 --> 00:17:01.560
doing better load balancing
automatically, you can get up

00:17:01.560 --> 00:17:03.350
to 1.4 GigaFLOPS.

00:17:03.350 --> 00:17:06.420
So really good performance and
really matching the inherent

00:17:06.420 --> 00:17:07.800
parallelism to the
architecture.

00:17:10.380 --> 00:17:13.510
So it was just a big overview of
the raw chip and what we've

00:17:13.510 --> 00:17:14.820
done with it in lab.

00:17:14.820 --> 00:17:17.620
There's more in here than
I've talked about.

00:17:17.620 --> 00:17:20.430
But what I'm going to do next
is give you some insights as

00:17:20.430 --> 00:17:22.700
to what is the design philosophy
that went into raw

00:17:22.700 --> 00:17:27.000
architecture, why was it
designed the way it was.

00:17:27.000 --> 00:17:28.810
And then I'm going to talk a
little bit about the raw

00:17:28.810 --> 00:17:30.310
parallelizing compiler.

00:17:30.310 --> 00:17:33.550
And while the StreamIt language
and compiler also has

00:17:33.550 --> 00:17:36.190
a back end for the raw
architecture, we've sort of

00:17:36.190 --> 00:17:37.300
seen that in previous lectures
so I'm not going to

00:17:37.300 --> 00:17:38.580
talk about that here.

00:17:38.580 --> 00:17:42.520
So I'm just going to focus
on the first two bullets.

00:17:42.520 --> 00:17:47.680
And a few years ago when the
project got started, sort of

00:17:47.680 --> 00:17:52.430
the insight in the wide issue
processors and the design

00:17:52.430 --> 00:17:55.580
philosophy that was being
followed in industry for

00:17:55.580 --> 00:17:58.760
building wider superscalars,
faster superscalars, was

00:17:58.760 --> 00:18:01.560
really going to come to a halt
largely because you have

00:18:01.560 --> 00:18:03.340
scalability issues.

00:18:03.340 --> 00:18:06.840
So if you look at sort of a
simplified illustration of a

00:18:06.840 --> 00:18:10.210
wide issue microprocessor, you
have your program counter such

00:18:10.210 --> 00:18:11.070
as instructions.

00:18:11.070 --> 00:18:12.740
Goes into some control logic.

00:18:12.740 --> 00:18:14.200
Control logic is then
going to run.

00:18:14.200 --> 00:18:15.370
You're going to read
some variables from

00:18:15.370 --> 00:18:17.210
the register file.

00:18:17.210 --> 00:18:19.700
You'll have a big crossbar in
the middle that routes to

00:18:19.700 --> 00:18:20.570
operands like ALUs.

00:18:20.570 --> 00:18:23.600
Yell And then you operate on
those and you have to send it

00:18:23.600 --> 00:18:25.430
back to the register file.

00:18:25.430 --> 00:18:30.220
Plus you have this really big
problem with the network.

00:18:30.220 --> 00:18:32.850
So if you're doing some
computation--

00:18:32.850 --> 00:18:35.110
sorry, I rearranged
these slides.

00:18:35.110 --> 00:18:38.290
So what you have if you have n
ALUs, then the complexity of

00:18:38.290 --> 00:18:41.660
your crossbar increases as
n squared, because you

00:18:41.660 --> 00:18:42.900
essentially have to
have everybody

00:18:42.900 --> 00:18:44.880
talking to each other.

00:18:44.880 --> 00:18:47.260
And in terms of the number of
wires that you need out of the

00:18:47.260 --> 00:18:49.970
register file to support
everybody being able to sort

00:18:49.970 --> 00:18:52.470
of talk to anybody else very
efficiently, the number of

00:18:52.470 --> 00:18:54.970
ports, the number of wires
increases n cubed.

00:18:54.970 --> 00:18:57.600
So that's a problem because
you can't clock all those

00:18:57.600 --> 00:18:59.150
wires fast enough.

00:18:59.150 --> 00:19:01.410
The frequency becomes
sort of limited.

00:19:01.410 --> 00:19:04.380
It grows even less
than linearly.

00:19:04.380 --> 00:19:08.230
And this is a problem because
operational routing-- operand

00:19:08.230 --> 00:19:09.620
routing, is global.

00:19:09.620 --> 00:19:12.900
So if I have- I'm doing some
operations and it's an add,

00:19:12.900 --> 00:19:16.760
the results of this add is fed
to another operation to shift,

00:19:16.760 --> 00:19:19.660
and these are going to execute
on two different ALUs.

00:19:19.660 --> 00:19:22.860
So what's going to happen?

00:19:22.860 --> 00:19:24.410
I do the add operation.

00:19:24.410 --> 00:19:26.100
It's going to produce
a result.

00:19:26.100 --> 00:19:30.100
But there's no direct path for
this ALU to send this result

00:19:30.100 --> 00:19:30.560
to this ALU.

00:19:30.560 --> 00:19:33.530
So instead what has happened is
the operand has to travel

00:19:33.530 --> 00:19:36.195
all the way back around through
the crossbar and then

00:19:36.195 --> 00:19:37.445
back to this ALU.

00:19:39.700 --> 00:19:43.210
So that's really just going to
take a long time and not

00:19:43.210 --> 00:19:44.300
necessarily very efficient.

00:19:44.300 --> 00:19:48.140
And if you're doing this for a
lot of ALU operations, you

00:19:48.140 --> 00:19:49.780
have a lot of parallelism in
your application level,

00:19:49.780 --> 00:19:51.750
instructional level parallelism,
and that's just

00:19:51.750 --> 00:19:53.300
creating a lot of
communication.

00:19:53.300 --> 00:19:55.170
But you're not really exploiting
the locality of the

00:19:55.170 --> 00:19:56.830
computation.

00:19:56.830 --> 00:19:59.440
If 2 instructions are really
close together, you want to be

00:19:59.440 --> 00:20:01.890
able to just have a
point-to-point path, for

00:20:01.890 --> 00:20:05.050
example, or a shorter path that
allows you to exploit

00:20:05.050 --> 00:20:07.920
where was instructions
are in space.

00:20:07.920 --> 00:20:11.220
And so this was the driving
insight for the architecture

00:20:11.220 --> 00:20:14.850
in that you want to make
operand routing local.

00:20:14.850 --> 00:20:18.110
So an idea is to essentially
exploit this locality by

00:20:18.110 --> 00:20:19.730
distributing the ALUs.

00:20:19.730 --> 00:20:22.570
And rather than having that
massive crossbar, what you

00:20:22.570 --> 00:20:25.660
want to do is have an on-chip
mesh network.

00:20:25.660 --> 00:20:28.393
So rather than have one big
crossbar, you have lots of

00:20:28.393 --> 00:20:29.060
smaller ones.

00:20:29.060 --> 00:20:31.040
So these become switch
processors.

00:20:31.040 --> 00:20:34.750
So I can put value from this
ALU here and then have that

00:20:34.750 --> 00:20:37.350
value routed to any other ALU.

00:20:37.350 --> 00:20:39.990
Maybe that just cost me more in
terms of instructions that

00:20:39.990 --> 00:20:42.770
says where this operand
is going.

00:20:42.770 --> 00:20:44.320
We'll get into that.

00:20:44.320 --> 00:20:46.580
But here, what this allows
me to do is exploit

00:20:46.580 --> 00:20:47.790
that locality better.

00:20:47.790 --> 00:20:51.650
Same instruction chain, I can
put the first operation on one

00:20:51.650 --> 00:20:55.950
ALU, I can put the other
operation on the second ALU.

00:20:55.950 --> 00:20:58.240
And here, rather than putting
it for example here, which

00:20:58.240 --> 00:21:01.230
would send the operand really
far across chip, what I want

00:21:01.230 --> 00:21:03.660
to do is recognize that there's
a producer/consumer

00:21:03.660 --> 00:21:04.800
relationship here.

00:21:04.800 --> 00:21:07.245
I want to exploit that locality
and have them close

00:21:07.245 --> 00:21:11.260
in spaces so that the routes
remain fairly short.

00:21:11.260 --> 00:21:13.890
You know what I can also do is
sort of pipeline this network

00:21:13.890 --> 00:21:16.770
so that I can have the hardware
essential match

00:21:16.770 --> 00:21:18.530
computation flow.

00:21:18.530 --> 00:21:22.900
If one ALU is producing a lot
of results at a lot faster

00:21:22.900 --> 00:21:25.680
rate than for example this
instruction can consume them,

00:21:25.680 --> 00:21:29.470
then the hardware can take care
of, for example, blocking

00:21:29.470 --> 00:21:32.680
or stalling the producing
processor so it doesn't get

00:21:32.680 --> 00:21:33.650
too far ahead.

00:21:33.650 --> 00:21:36.490
It gives you a nature mechanism
for regulating the

00:21:36.490 --> 00:21:39.940
flow data on the chip.

00:21:39.940 --> 00:21:44.680
Well, this is better than what
we saw before because with the

00:21:44.680 --> 00:21:47.260
crossbar you're not really
getting any scalability in

00:21:47.260 --> 00:21:50.670
terms of your latency transport
operands from one

00:21:50.670 --> 00:21:53.380
ALU to another.

00:21:53.380 --> 00:21:56.790
Whereas with on-chip network,
if you've taken routing

00:21:56.790 --> 00:21:59.340
classes, you know that there
exists an algorithm that sort

00:21:59.340 --> 00:22:03.030
of allows it to route things at
least the square root of n,

00:22:03.030 --> 00:22:05.170
where n is the number of things
that are communicating

00:22:05.170 --> 00:22:06.380
in your network.

00:22:06.380 --> 00:22:08.560
But if you're doing locality
driven placement then it's

00:22:08.560 --> 00:22:10.040
essentially costing time.

00:22:10.040 --> 00:22:12.190
And in a raw chip, it's
in fact three cycles.

00:22:12.190 --> 00:22:15.220
So you can send one operand from
one tile to another in

00:22:15.220 --> 00:22:15.780
three cycles.

00:22:15.780 --> 00:22:18.730
And we'll get into how that
number comes about.

00:22:18.730 --> 00:22:19.830
So this is much better.

00:22:19.830 --> 00:22:21.450
But what it does
is increase the

00:22:21.450 --> 00:22:22.750
complexity on the compiler.

00:22:22.750 --> 00:22:25.780
It says, this is my computation,
how do you map it

00:22:25.780 --> 00:22:28.960
efficiently so that things are
clustered in space well so

00:22:28.960 --> 00:22:33.880
that I don't have these really
long routes for communication?

00:22:33.880 --> 00:22:36.190
But then we can look at what
else can we distribute.

00:22:36.190 --> 00:22:38.640
Well, we have the
register file.

00:22:38.640 --> 00:22:41.240
We can distribute that
across all the ALUs.

00:22:41.240 --> 00:22:44.500
And that essentially decreases
that n cubed relationships

00:22:44.500 --> 00:22:47.980
between ALUs and register file
ports to something that's a

00:22:47.980 --> 00:22:49.130
lot more tractable.

00:22:49.130 --> 00:22:54.170
Where it's one small
register per ALU.

00:22:54.170 --> 00:22:57.370
And this is better in terms of
scalability, but we haven't

00:22:57.370 --> 00:22:59.870
solved the entire problem in
that we still have one global

00:22:59.870 --> 00:23:03.390
program counter, we have one
global instruction fetch unit,

00:23:03.390 --> 00:23:07.240
one global control unified
load/store queue for

00:23:07.240 --> 00:23:08.600
communicating with memory.

00:23:08.600 --> 00:23:13.850
And those all have scalability
problems. So whereas we fixed

00:23:13.850 --> 00:23:15.360
the problem with
the crossbar--

00:23:15.360 --> 00:23:17.840
that becomes more scalable--

00:23:17.840 --> 00:23:19.940
we haven't really fix the
problems with the others.

00:23:19.940 --> 00:23:22.530
So what's the natural
solution to do here?

00:23:22.530 --> 00:23:26.250
Well, we'll just distribute
everything else.

00:23:26.250 --> 00:23:30.090
And so you start off with each
ALU here now having it's own

00:23:30.090 --> 00:23:32.610
program counter, its own
instruction cache, it's own

00:23:32.610 --> 00:23:33.540
data cache.

00:23:33.540 --> 00:23:37.610
And it has its register file ALU
and everybody-- that same

00:23:37.610 --> 00:23:40.560
sort of design pattern
is repeated for each

00:23:40.560 --> 00:23:41.840
one of those ALUs.

00:23:41.840 --> 00:23:44.340
So now it looks like it's
a lot scalable.

00:23:44.340 --> 00:23:46.320
I don't have any global wires.

00:23:46.320 --> 00:23:49.100
There's no global centralized
data structure.

00:23:49.100 --> 00:23:52.220
And all of that means I
can do things more--

00:23:52.220 --> 00:23:55.600
I can do things faster
and more efficiently.

00:23:55.600 --> 00:23:58.990
And what you start seeing here
is this sort of tile processor

00:23:58.990 --> 00:24:00.110
coming about all.

00:24:00.110 --> 00:24:03.920
So each one of those things
was exactly the same.

00:24:03.920 --> 00:24:06.470
And what was done in the raw
processor is that none of

00:24:06.470 --> 00:24:09.880
those tiles was longer than
you can communicate in one

00:24:09.880 --> 00:24:10.710
clock cycle.

00:24:10.710 --> 00:24:14.850
So this solved essentially a
wire delay problem as well.

00:24:14.850 --> 00:24:17.600
So if this is the distance
that a wire--

00:24:17.600 --> 00:24:19.340
that a signal can travel
in one clock

00:24:19.340 --> 00:24:21.970
cycle, the tile is smaller.

00:24:21.970 --> 00:24:23.810
It can fit within this circle.

00:24:23.810 --> 00:24:26.820
So that means that you're
guaranteed--

00:24:26.820 --> 00:24:29.200
you have better scalability
problems. You're solving the

00:24:29.200 --> 00:24:32.860
issues that people are facing
with wire delay.

00:24:32.860 --> 00:24:36.940
And in terms of the tile
processor abstraction, Michael

00:24:36.940 --> 00:24:41.680
Taylor was is a PhD student in
the raw group, his thesis sort

00:24:41.680 --> 00:24:45.780
of identified the tile processor
approach and this

00:24:45.780 --> 00:24:48.000
aspect of the tile processor
approach that makes it more

00:24:48.000 --> 00:24:50.880
attractive, the SON.

00:24:50.880 --> 00:24:52.990
Which is the scalar
operand network.

00:24:52.990 --> 00:24:57.080
And the next two slides, the
next part of the lecture, is

00:24:57.080 --> 00:25:00.120
going to really focus
on what that means.

00:25:00.120 --> 00:25:02.170
He argues why the tile

00:25:02.170 --> 00:25:05.340
processor approach is scalable.

00:25:05.340 --> 00:25:07.160
And it's scalable for the same
reasons as multicores.

00:25:07.160 --> 00:25:09.350
You just add more and more
cores on a chip.

00:25:09.350 --> 00:25:13.910
But the intrinsic difference
between the multicore that you

00:25:13.910 --> 00:25:16.580
see today and the raw
architecture is the scalar

00:25:16.580 --> 00:25:18.150
operand network.

00:25:18.150 --> 00:25:20.960
So I'm going to ask you
questions about

00:25:20.960 --> 00:25:22.690
this in a few slides.

00:25:22.690 --> 00:25:25.620
But really what you're getting
here is the ability to

00:25:25.620 --> 00:25:28.980
communicate from one processor
to another very efficiently.

00:25:28.980 --> 00:25:31.990
And the way you do this on raw
is you have your instruction

00:25:31.990 --> 00:25:35.340
fetch d code, register file
read stage, WALU--

00:25:35.340 --> 00:25:38.090
your competition pipeline.

00:25:38.090 --> 00:25:41.430
But part of the registers-- the
new register file-- so 24

00:25:41.430 --> 00:25:43.960
through 27 are network mapped.

00:25:43.960 --> 00:25:46.890
So what that means is, if
I write-- if one of the

00:25:46.890 --> 00:25:51.800
operations that I have in my
computation has a destination

00:25:51.800 --> 00:25:56.480
register that's 24, 25, 26 or
27, that value automatically

00:25:56.480 --> 00:25:59.360
get sent to the output
network.

00:25:59.360 --> 00:26:01.150
And if I have a value--

00:26:01.150 --> 00:26:04.960
if one of my source operands is
registered at 24, 25, 26 or

00:26:04.960 --> 00:26:08.340
27, implicitly that means get
that value off the network.

00:26:12.010 --> 00:26:15.780
And so I can have add 25--

00:26:15.780 --> 00:26:18.560
added to register 25-- so this
is one of the network map

00:26:18.560 --> 00:26:20.760
ports, sum two operands.

00:26:20.760 --> 00:26:23.150
So this is a picture
of the raw chip.

00:26:23.150 --> 00:26:25.100
This is one tile.

00:26:25.100 --> 00:26:26.760
This is the other tile.

00:26:26.760 --> 00:26:30.250
So you can sort of see the
computation and the network

00:26:30.250 --> 00:26:32.110
switch processor here.

00:26:32.110 --> 00:26:36.340
So the operand flows into the
network and then gets

00:26:36.340 --> 00:26:39.360
transported across from
one tile to the other.

00:26:39.360 --> 00:26:40.800
And then gets injected
into the other

00:26:40.800 --> 00:26:43.270
tiles compute networks.

00:26:43.270 --> 00:26:46.700
And here this instruction has
sort a source operand that

00:26:46.700 --> 00:26:48.250
that's register map operand.

00:26:48.250 --> 00:26:49.730
So it knows where to
get its value from.

00:26:49.730 --> 00:26:51.830
And then you can do
the computation.

00:26:51.830 --> 00:26:55.200
An interesting aspect here
is that while you've seen

00:26:55.200 --> 00:26:58.080
instructions like this, just
normal instructions, here you

00:26:58.080 --> 00:27:02.220
also have explicit routing
instructions that are executed

00:27:02.220 --> 00:27:04.330
on the switch processor.

00:27:04.330 --> 00:27:06.960
So the switch processor here
says take the value that's

00:27:06.960 --> 00:27:11.990
coming from my processor and
send it east. So each

00:27:11.990 --> 00:27:15.360
processor can send values east,
west, north or south.

00:27:15.360 --> 00:27:17.950
So it can go to the tile above
it, the tile below it, the

00:27:17.950 --> 00:27:20.650
tile to the left of it or
tile to the right of it.

00:27:20.650 --> 00:27:24.290
And so sending it east sends
it along this wire here.

00:27:24.290 --> 00:27:27.120
And then this particular switch
processor says get a

00:27:27.120 --> 00:27:30.910
value from the west port and
send it to my processor.

00:27:30.910 --> 00:27:33.970
Now you could have had here,
this process could say, this

00:27:33.970 --> 00:27:37.060
value is not for me, so I want
to just pass through to some

00:27:37.060 --> 00:27:37.980
other processor.

00:27:37.980 --> 00:27:40.770
So you can pass it from the west
port to the south port or

00:27:40.770 --> 00:27:44.170
to the north port or just pass
it through laterally to the

00:27:44.170 --> 00:27:46.530
other east port.

00:27:46.530 --> 00:27:48.000
So it just allows you to
essentially just have an

00:27:48.000 --> 00:27:50.480
on-chip network and not
operand-- you can imagine

00:27:50.480 --> 00:27:55.040
having an operand that has a
data packet and header that's

00:27:55.040 --> 00:27:58.540
says, I'm going to tile 10
and the switches know

00:27:58.540 --> 00:27:59.510
which way to send it.

00:27:59.510 --> 00:28:01.700
But the interesting aspect
here is that the compiler

00:28:01.700 --> 00:28:04.060
actually orchestrates the
communication, so you don't

00:28:04.060 --> 00:28:06.612
need that extra header that
says, I'm going to tile 10.

00:28:06.612 --> 00:28:09.380
You just have to generate a
schedule of how to write that

00:28:09.380 --> 00:28:11.250
data through.

00:28:11.250 --> 00:28:13.170
So we'll get into what that
means for the compiler in

00:28:13.170 --> 00:28:16.140
terms of that added
complexity.

00:28:16.140 --> 00:28:19.630
So communication on multicores
is expensive for

00:28:19.630 --> 00:28:20.640
the following reasons.

00:28:20.640 --> 00:28:24.400
And this is really sort of going
contrast or going to put

00:28:24.400 --> 00:28:26.360
the scalar operand network
into slightly more

00:28:26.360 --> 00:28:27.450
perspective.

00:28:27.450 --> 00:28:31.480
But first, so how do you
communicate between multicores

00:28:31.480 --> 00:28:32.650
on the cell?

00:28:32.650 --> 00:28:36.510
You have the DMA transfers
from one SPE to another.

00:28:36.510 --> 00:28:39.570
You can't really ship an
operand single value.

00:28:39.570 --> 00:28:43.030
So if I write the value x, and
I want to send x from one SPE

00:28:43.030 --> 00:28:46.790
to another, I can't really do
that very efficiently, right?

00:28:46.790 --> 00:28:52.140
So this is essentially the
contrasting thing between

00:28:52.140 --> 00:28:55.320
multicore processors that
largely exist today and the

00:28:55.320 --> 00:28:56.350
raw processor.

00:28:56.350 --> 00:29:00.210
So I've shown you an empirical--
a quantitative--

00:29:00.210 --> 00:29:04.170
an analytical model for
communication costs before in

00:29:04.170 --> 00:29:06.380
earlier slides.

00:29:06.380 --> 00:29:08.740
This is an illustration
of that concept.

00:29:08.740 --> 00:29:12.370
So if I have a processor that's
talking to another,

00:29:12.370 --> 00:29:16.230
that value has to travel
across some network and

00:29:16.230 --> 00:29:18.940
there's some transport costs
associated with that.

00:29:18.940 --> 00:29:20.590
But there's also some
added complexities.

00:29:20.590 --> 00:29:22.760
So there were lots of terms,
if you remember, in that

00:29:22.760 --> 00:29:25.730
really big equation
I've shown before.

00:29:25.730 --> 00:29:29.100
You have some overhead in terms
of packaging the data.

00:29:29.100 --> 00:29:32.040
And you have some overhead in
terms of unpacking the data.

00:29:32.040 --> 00:29:33.420
So what does that look?

00:29:33.420 --> 00:29:36.580
Well, there are two components
we're going to break this down

00:29:36.580 --> 00:29:39.020
to: the send occupancy
and send latency.

00:29:39.020 --> 00:29:40.530
And I'm going to talk
about each of those.

00:29:40.530 --> 00:29:43.050
And similarly on the receive
side, you have the receive

00:29:43.050 --> 00:29:45.640
latency and the receive
occupancy.

00:29:45.640 --> 00:29:50.400
So bear in mind, this lifetime
of a message essentially has

00:29:50.400 --> 00:29:52.820
to flow through these
five components.

00:29:52.820 --> 00:29:55.810
It has to go through the
occupancy stage, then there's

00:29:55.810 --> 00:29:59.810
the send latency, transport,
receive latency and receive

00:29:59.810 --> 00:30:04.830
occupancy before you can
actually use it to compute on.

00:30:04.830 --> 00:30:06.670
So what are some things
that you do here?

00:30:06.670 --> 00:30:09.900
Well, it's things that you've
done on cell for getting VME

00:30:09.900 --> 00:30:10.890
transfers to work.

00:30:10.890 --> 00:30:14.040
You have to figure who the
destination is, what is the

00:30:14.040 --> 00:30:17.800
value, maybe you have an idea
associated with it, a tag,

00:30:17.800 --> 00:30:18.630
things of that sort.

00:30:18.630 --> 00:30:20.120
And you have to essentially
inject that

00:30:20.120 --> 00:30:22.530
message into the network.

00:30:22.530 --> 00:30:24.210
So there's some latency
associated with that.

00:30:24.210 --> 00:30:26.370
Maybe your--

00:30:26.370 --> 00:30:31.480
on cell you have a DMA engine
which essentially hides this

00:30:31.480 --> 00:30:32.510
latency for you.

00:30:32.510 --> 00:30:34.520
Because you can essentially just
send the message to the

00:30:34.520 --> 00:30:36.110
DMA, right into its queue.

00:30:36.110 --> 00:30:39.530
And you can especially forget
about it unless it stalls

00:30:39.530 --> 00:30:43.340
because the DMA list is full.

00:30:43.340 --> 00:30:45.890
On the receive side, you sort
of have a similar thing.

00:30:45.890 --> 00:30:49.810
You have to get the network to
inject that value into the

00:30:49.810 --> 00:30:53.005
processor and then you have to
depackage it, demultiplex it

00:30:53.005 --> 00:30:55.960
and put it into some form that
you can actually use to

00:30:55.960 --> 00:30:57.670
operate on it.

00:30:57.670 --> 00:31:01.700
So this 5-tuple is gives us a
way of sort of characterizing

00:31:01.700 --> 00:31:05.570
communication patterns on
different architectures.

00:31:05.570 --> 00:31:09.530
So I can contrast, for example,
raw versus the

00:31:09.530 --> 00:31:12.520
traditional microprocessor.

00:31:12.520 --> 00:31:15.460
So this is a traditional
superscalar.

00:31:15.460 --> 00:31:18.800
A traditional superscalar
essentially has all the

00:31:18.800 --> 00:31:22.200
sophisticated circuitry that
allows you to essentially

00:31:22.200 --> 00:31:23.660
bypass network.

00:31:23.660 --> 00:31:26.020
You can have an operand directly
flowing to another

00:31:26.020 --> 00:31:29.950
ALU through all the n squared
wires in the crossbar.

00:31:29.950 --> 00:31:33.320
And a lot of dynamic scheduling
is going on.

00:31:33.320 --> 00:31:37.110
So it really has no occupancy,
latency, you're not really

00:31:37.110 --> 00:31:39.470
doing any packaging
of the operands.

00:31:39.470 --> 00:31:43.460
Your transport cost is
essentially completely hidden.

00:31:43.460 --> 00:31:46.000
You have no complexity
on the receive side.

00:31:46.000 --> 00:31:47.540
So it's really efficient.

00:31:47.540 --> 00:31:50.140
So this is essentially what you
want to get to go: this

00:31:50.140 --> 00:31:51.250
kind of 5-tuple.

00:31:51.250 --> 00:31:54.170
But as we saw before, it's
really not scalable because

00:31:54.170 --> 00:31:57.460
the wire complexity woes--
whether it's n squared or n

00:31:57.460 --> 00:31:59.480
cubed, that's not good
from an energy

00:31:59.480 --> 00:32:01.150
efficient point of view.

00:32:01.150 --> 00:32:02.340
Scalable multiprocessors--

00:32:02.340 --> 00:32:05.580
these are on-chip
multiprocessors more

00:32:05.580 --> 00:32:08.770
indicative of things that you
have today-- have this kind of

00:32:08.770 --> 00:32:12.210
5-tuple where you have about
16 cycles just to get a

00:32:12.210 --> 00:32:15.355
message out, know roughly
3 cycles are so

00:32:15.355 --> 00:32:16.890
to transport message.

00:32:16.890 --> 00:32:19.370
So maybe this is being done
through a shared cache.

00:32:19.370 --> 00:32:22.120
Which is how a lot of
architecture communicates

00:32:22.120 --> 00:32:23.300
between processors today.

00:32:23.300 --> 00:32:26.970
And you have to sort of
demultiplex the message on the

00:32:26.970 --> 00:32:28.130
receive side.

00:32:28.130 --> 00:32:30.280
So that adds some latency.

00:32:30.280 --> 00:32:34.580
In raw, because you have these
net memory map registers on

00:32:34.580 --> 00:32:37.210
the input side and the output
side, you really can knock

00:32:37.210 --> 00:32:44.790
down the complexity from the
send side in terms of the

00:32:44.790 --> 00:32:46.770
occupancy and latency to zero.

00:32:46.770 --> 00:32:48.610
And you just write the values
to the register.

00:32:48.610 --> 00:32:50.490
And it looks like a normal
register, right?

00:32:50.490 --> 00:32:53.500
But it just magically appears
on the network.

00:32:53.500 --> 00:32:56.380
And then from one tile to
another, it's one cycle to

00:32:56.380 --> 00:32:59.380
ship the value across that one
link from one switch processor

00:32:59.380 --> 00:33:02.020
to the other, as long as
it's a near neighbor.

00:33:02.020 --> 00:33:04.080
And then two cycles to
inject the network

00:33:04.080 --> 00:33:05.820
into the tile processor.

00:33:05.820 --> 00:33:08.270
And then you're ready
to use it.

00:33:08.270 --> 00:33:12.790
So in this space, where would
you put cell is the question?

00:33:12.790 --> 00:33:14.310
Anybody have any ideas?

00:33:19.670 --> 00:33:21.790
What would the communication
panel look like on cell?

00:33:27.960 --> 00:33:30.930
So you have to do explicit
sends and receives.

00:33:30.930 --> 00:33:35.450
So let's look at this.

00:33:35.450 --> 00:33:38.000
So can we get rid of this
stage on cell which is

00:33:38.000 --> 00:33:40.160
essentially saying
packaging up my

00:33:40.160 --> 00:33:42.190
message, is it's no, right?

00:33:42.190 --> 00:33:44.500
Because you have to essentially
say where that DMA

00:33:44.500 --> 00:33:46.680
transfer is going to go to--
which region of memory?

00:33:46.680 --> 00:33:49.670
So you're buildings these
control blocks.

00:33:49.670 --> 00:33:54.230
And then the send latency here
is roughly zero, because you

00:33:54.230 --> 00:33:56.090
have the DMA processor which
allows that kind of

00:33:56.090 --> 00:33:58.830
concurrency between
communication and computation,

00:33:58.830 --> 00:34:03.560
so you can hide essentially that
part of the transport--

00:34:03.560 --> 00:34:05.760
that part of communication
costs.

00:34:05.760 --> 00:34:09.210
Your transport costs here, you
have this really massive

00:34:09.210 --> 00:34:10.860
bandwidth, this really
high bandwidth

00:34:10.860 --> 00:34:11.750
interconnect on the chip.

00:34:11.750 --> 00:34:14.520
So this makes it reasonably
fast, but

00:34:14.520 --> 00:34:16.430
it's still a few cycles.

00:34:16.430 --> 00:34:18.970
There's no near neighbor?

00:34:18.970 --> 00:34:22.420
Yeah, a hundred cycles to go
near neighbor communication.

00:34:22.420 --> 00:34:24.160
Because you're still--

00:34:24.160 --> 00:34:26.210
you don't have that fast
mechanism of being able to

00:34:26.210 --> 00:34:27.910
send things points to point.

00:34:27.910 --> 00:34:32.020
You're putting things on the
bus and there's some

00:34:32.020 --> 00:34:33.690
complexity there.

00:34:33.690 --> 00:34:36.345
On the receive, you have the
same kind of complexity that

00:34:36.345 --> 00:34:37.820
you had on the send side.

00:34:37.820 --> 00:34:39.770
You have to know that the
message is coming, that can be

00:34:39.770 --> 00:34:41.150
done in different ways.

00:34:41.150 --> 00:34:43.790
And then you have to take that
message and write it into your

00:34:43.790 --> 00:34:45.380
local store.

00:34:45.380 --> 00:34:50.610
Which also adds some overhead in
terms of the communication

00:34:50.610 --> 00:34:57.970
cost. So the cell would probably
be somewhere up here,

00:34:57.970 --> 00:34:58.530
I would imagine.

00:34:58.530 --> 00:35:00.300
I didn't have a chance
to get the numbers.

00:35:00.300 --> 00:35:04.490
If I do, I'll update
the slide later on.

00:35:04.490 --> 00:35:08.770
OK, so that's essentially a
brief insight into the raw--

00:35:08.770 --> 00:35:09.100
yeah?

00:35:09.100 --> 00:35:13.550
AUDIENCE: Where did you get
the scalable processor?

00:35:13.550 --> 00:35:17.500
PROFESSOR RABBAH: So these are
from Michael Taylor's thesis.

00:35:17.500 --> 00:35:21.790
So I believe what he's done here
is just looked at some

00:35:21.790 --> 00:35:24.520
existing microprocessor and
essentially benchmarked

00:35:24.520 --> 00:35:27.050
communication latency from
one processor to another.

00:35:27.050 --> 00:35:30.696
AUDIENCE: So this is like going
through the cache on the

00:35:30.696 --> 00:35:30.830
[OBSCURED]?

00:35:30.830 --> 00:35:32.010
PROFESSOR RABBAH: That's
in fact how you--

00:35:32.010 --> 00:35:34.310
a lot of these multiprocessors
today have shared caches,

00:35:34.310 --> 00:35:37.770
either L-1 and more
so now it's L-2.

00:35:37.770 --> 00:35:38.300
So if you have--

00:35:38.300 --> 00:35:40.640
L-1s are dedicated to different
processors.

00:35:40.640 --> 00:35:41.890
But you still have to go the
memory to communicate.

00:35:45.750 --> 00:35:48.360
So the raw parallelizing
compiler-- yeah?

00:35:48.360 --> 00:35:50.310
Another question?

00:35:50.310 --> 00:35:52.540
AUDIENCE: You might want to
postpone this question.

00:35:52.540 --> 00:35:57.950
Two related questions:
so raw has--

00:35:57.950 --> 00:36:00.500
I guess raw has pretty well
optimized nearest neighbor

00:36:00.500 --> 00:36:02.170
communication.

00:36:02.170 --> 00:36:08.074
But we know from, for example,
Red's Rule in heuristic and

00:36:08.074 --> 00:36:11.910
intellectual engineering about
the number of wires needed for

00:36:11.910 --> 00:36:12.652
a given area.

00:36:12.652 --> 00:36:14.190
Is that in between--

00:36:14.190 --> 00:36:21.050
as I recall, it's the minimum
for a good sized circuit is

00:36:21.050 --> 00:36:24.592
proportional to the perimeter,
or roughly the

00:36:24.592 --> 00:36:27.480
square root of the area.

00:36:27.480 --> 00:36:33.070
And it ranges from there to--
not proportional to the area.

00:36:33.070 --> 00:36:34.955
There's something in between.

00:36:34.955 --> 00:36:36.790
Something with 3 in it.

00:36:36.790 --> 00:36:40.180
Like to the 3/2 power
I think, perhaps.

00:36:40.180 --> 00:36:41.710
No, something like 2/3rds,
something like--

00:36:41.710 --> 00:36:42.910
yeah, 2/3rds power.

00:36:42.910 --> 00:36:47.070
So the area to the 1/2 power or
area to the 2/3rds power.

00:36:47.070 --> 00:36:51.380
So Red's Rule says the number
of wires you need is roughly

00:36:51.380 --> 00:36:52.770
in that area.

00:36:52.770 --> 00:36:55.860
And so that sort of
pushes that--

00:36:55.860 --> 00:36:58.650
so the minimum you need is the
nearest communication.

00:36:58.650 --> 00:37:01.990
And often you need
more than that.

00:37:01.990 --> 00:37:06.470
We know from the FPGA experience
that nearest

00:37:06.470 --> 00:37:09.470
neighbor communication
is not--

00:37:09.470 --> 00:37:11.010
or, at least, it's good to
have move than nearest

00:37:11.010 --> 00:37:13.930
neighbor, and that often long
wires followed across the

00:37:13.930 --> 00:37:15.610
chip, in extremely high--

00:37:15.610 --> 00:37:16.600
PROFESSOR RABBAH: So I'm going
to actually show you an

00:37:16.600 --> 00:37:20.280
example where nearest neighbor
is good but you might also

00:37:20.280 --> 00:37:23.130
want some global mechanism
for control

00:37:23.130 --> 00:37:25.490
orchestration for example.

00:37:25.490 --> 00:37:28.470
AUDIENCE: Not just for con--
not surely just for control

00:37:28.470 --> 00:37:31.810
but for broadcast, for arbitrary
for the computation

00:37:31.810 --> 00:37:35.030
to use, not just for
the chip to use.

00:37:35.030 --> 00:37:38.970
Like why are you scaling out two
hops, four hops, fewer and

00:37:38.970 --> 00:37:39.450
fewer wire--

00:37:39.450 --> 00:37:42.110
PROFESSOR RABBAH: Yes, in fact
what I think is going to

00:37:42.110 --> 00:37:44.280
happen is a lot of these chip
designs are going to be

00:37:44.280 --> 00:37:45.090
heirarchical.

00:37:45.090 --> 00:37:49.570
You have some really global
type communication at the

00:37:49.570 --> 00:37:50.300
highest level.

00:37:50.300 --> 00:37:53.140
And then as you get within each
one of the processors,

00:37:53.140 --> 00:37:55.610
then you see things at the
lowest level, something that

00:37:55.610 --> 00:37:56.070
looked like raw.

00:37:56.070 --> 00:37:58.690
So you can build sort of a
hierarchy of communication

00:37:58.690 --> 00:38:02.590
stages that allow you to sort
of solve that problem.

00:38:02.590 --> 00:38:04.110
But all of that adds
complexity, right?

00:38:04.110 --> 00:38:05.540
First you have to solve the
problem of how do you

00:38:05.540 --> 00:38:09.120
parallelize for just a fixed
number of cores and then

00:38:09.120 --> 00:38:10.470
figure out the communications.

00:38:10.470 --> 00:38:13.050
Once we understand how to
do that well with a nice

00:38:13.050 --> 00:38:15.745
programming model then you can
build heirarchically on that.

00:38:15.745 --> 00:38:17.975
AUDIENCE: On the other hand, it
might make the compiler's

00:38:17.975 --> 00:38:20.250
job easier because it's
not as constrained.

00:38:20.250 --> 00:38:21.090
PROFESSOR RABBAH: It
might give you a

00:38:21.090 --> 00:38:21.570
nice fall back rate.

00:38:21.570 --> 00:38:24.915
It might save you in cases where
there are things that

00:38:24.915 --> 00:38:26.360
are hard to do.

00:38:26.360 --> 00:38:29.720
There are some issues
in the last two--

00:38:29.720 --> 00:38:33.120
the second to the last
three slides.

00:38:33.120 --> 00:38:36.862
We'll talk about an example of
where that might be the case.

00:38:36.862 --> 00:38:40.970
AUDIENCE: Another question
which [OBSCURED]

00:38:40.970 --> 00:38:45.770
so raw, I guess, being simpled
and tiled, I guess one of the

00:38:45.770 --> 00:38:47.436
selling points I think was that
it really cuts down on

00:38:47.436 --> 00:38:48.850
the engineering effort.

00:38:48.850 --> 00:38:49.580
PROFESSOR RABBAH:
Oh, absolutely.

00:38:49.580 --> 00:38:54.660
This was done a million gates
in-house for [OBSCURED]

00:38:54.660 --> 00:38:58.040
AUDIENCE: So a company like
Intel has a ridiculous number

00:38:58.040 --> 00:38:58.860
of engineers.

00:38:58.860 --> 00:39:01.485
And to get a competitive edge,
they something they want to

00:39:01.485 --> 00:39:02.431
apply more engineering to it.

00:39:02.431 --> 00:39:05.902
And so the question is, where
might you apply more

00:39:05.902 --> 00:39:07.760
engineering to try
to squeeze more--

00:39:07.760 --> 00:39:09.416
PROFESSOR AMARASINGHE: That's
the million dollar question

00:39:09.416 --> 00:39:11.220
that everybody's looking at.

00:39:11.220 --> 00:39:14.188
Because if somehow Intel thought
they could add more

00:39:14.188 --> 00:39:15.570
and more engineering.

00:39:15.570 --> 00:39:19.520
And then build this very complex
full-scale [OBSCURED]

00:39:19.520 --> 00:39:22.200
But separate vessels.

00:39:22.200 --> 00:39:26.500
And so I think there's still a
lot of things that is wrong.

00:39:26.500 --> 00:39:33.090
Meaning it's [OBSCURED]

00:39:33.090 --> 00:39:35.100
so at Intel basically
they will let you do

00:39:35.100 --> 00:39:36.450
something like that.

00:39:36.450 --> 00:39:39.650
They will put a lot of engineers
doing each of these

00:39:39.650 --> 00:39:43.640
components, finding very few,
and they can get a lot more

00:39:43.640 --> 00:39:47.140
performance, a lot less power
and stuff like that.

00:39:47.140 --> 00:39:53.150
So depending on what you want,
science is not everything.

00:39:53.150 --> 00:39:58.420
There are a lot of other
things [OBSCURED]

00:39:58.420 --> 00:40:00.820
So while it makes it easier?

00:40:00.820 --> 00:40:08.260
[OBSCURED]

00:40:08.260 --> 00:40:11.435
And the key thing is, you start
something simple and as

00:40:11.435 --> 00:40:14.220
you go on, you can add more
and more complexity.

00:40:14.220 --> 00:40:18.510
Just, as there's more
things to do.

00:40:18.510 --> 00:40:20.680
PROFESSOR RABBAH: Part of the
complexity might be going to--

00:40:20.680 --> 00:40:25.860
not making all those
[OBSCURED].

00:40:25.860 --> 00:40:30.030
OK, so raw pushes a lot of the
complexity into the compiler

00:40:30.030 --> 00:40:33.240
in that the compiler now has
to do at least two things.

00:40:33.240 --> 00:40:35.250
It has to distribute
the instructions.

00:40:35.250 --> 00:40:37.450
You have a single program and
you have to figure out how to

00:40:37.450 --> 00:40:39.140
parallelize it across
multiple cores.

00:40:39.140 --> 00:40:41.900
But not only that, because you
have the scalar operand

00:40:41.900 --> 00:40:44.480
network, you have to figure out
how the different cores

00:40:44.480 --> 00:40:45.410
have to talk to each other.

00:40:45.410 --> 00:40:47.790
So you have to essentially
generate schedule for the

00:40:47.790 --> 00:40:50.400
switch processors as well.

00:40:50.400 --> 00:40:52.055
So I'm going to talk a
little bit about the

00:40:52.055 --> 00:40:53.480
raw paralyzing compiler.

00:40:53.480 --> 00:40:55.470
And this is different from
a StreamIT parallelizing

00:40:55.470 --> 00:40:58.890
compiler which really talks
about a different program as

00:40:58.890 --> 00:41:01.450
an input, using a different
language.

00:41:01.450 --> 00:41:04.830
This is work again done here
at MIT by Walter Lee who

00:41:04.830 --> 00:41:07.570
graduated two years ago.

00:41:07.570 --> 00:41:09.050
We have a sequential program.

00:41:09.050 --> 00:41:14.030
You inject it into raw C seed,
raw C compiler, and you get

00:41:14.030 --> 00:41:17.070
fine-grained Orchestrated
Parallel execution.

00:41:17.070 --> 00:41:20.700
And what the compiler does is
worry about data distribution

00:41:20.700 --> 00:41:23.290
just like you have to do on cell
in terms of which memory

00:41:23.290 --> 00:41:25.270
goes into which local store.

00:41:25.270 --> 00:41:27.560
which competition
operates on--

00:41:27.560 --> 00:41:29.540
the raw compiler has to worry
about which computation

00:41:29.540 --> 00:41:32.460
operates on which data element
and can you put that data in

00:41:32.460 --> 00:41:36.370
the right caches for each
of the different tiles.

00:41:36.370 --> 00:41:39.400
Instruction distribution:
so the way this compiler

00:41:39.400 --> 00:41:41.060
essentially get parallelism,
it's going to look at

00:41:41.060 --> 00:41:43.270
instruction level parallelism
in your application.

00:41:43.270 --> 00:41:45.780
And it's going to divide that up
among the different cores.

00:41:45.780 --> 00:41:48.810
And then the last step is the
coordination of communication

00:41:48.810 --> 00:41:50.000
in control flow.

00:41:50.000 --> 00:41:51.330
So I'm just going
to briefly step

00:41:51.330 --> 00:41:53.570
through each one of those.

00:41:53.570 --> 00:41:56.890
So the data distribution really
has essentially trying

00:41:56.890 --> 00:41:58.410
to solve the problem
of locality.

00:41:58.410 --> 00:42:01.350
You have two instructions.

00:42:01.350 --> 00:42:04.030
A load into r1 from some
address and then

00:42:04.030 --> 00:42:05.410
you're adding r1.

00:42:05.410 --> 00:42:06.930
You're incrementing
that value.

00:42:06.930 --> 00:42:08.970
And you might write it
back for later on.

00:42:08.970 --> 00:42:11.110
So where would you put these
two instructions?

00:42:11.110 --> 00:42:15.060
So to exploit the locality, then
you want the data-- if

00:42:15.060 --> 00:42:18.020
the data is here, then you want
these two instructions to

00:42:18.020 --> 00:42:19.310
be on this tile.

00:42:19.310 --> 00:42:21.755
If the data is here, then you
want these two instructions to

00:42:21.755 --> 00:42:23.420
be on this file.

00:42:23.420 --> 00:42:25.700
Because it doesn't help you to
have the data here and the

00:42:25.700 --> 00:42:27.130
instructions here.

00:42:27.130 --> 00:42:29.120
Because what do you have
to do in that case?

00:42:29.120 --> 00:42:31.390
You have to send a message that
says, send me this data.

00:42:31.390 --> 00:42:34.050
And then you have to wait for
it to come in and then you

00:42:34.050 --> 00:42:35.020
have to operate on it.

00:42:35.020 --> 00:42:37.300
And then maybe you have
to write it back.

00:42:37.300 --> 00:42:39.220
So the compiler sort of
worries about the data

00:42:39.220 --> 00:42:40.030
distribution.

00:42:40.030 --> 00:42:42.190
It applies some data analysis.

00:42:42.190 --> 00:42:45.530
A lot of a thing that you saw in
Saman's lecture on classic

00:42:45.530 --> 00:42:47.020
parallelization technology.

00:42:47.020 --> 00:42:49.280
Sort of figure out the
interdependencies and then

00:42:49.280 --> 00:42:51.770
they can figure out how to split
up the data across the

00:42:51.770 --> 00:42:52.840
different cores.

00:42:52.840 --> 00:42:55.683
And there's some other work done
by other students in the

00:42:55.683 --> 00:42:58.470
group that tried to address
this problem.

00:42:58.470 --> 00:43:05.020
The instruction distribution is
perhaps as complicated and

00:43:05.020 --> 00:43:06.040
interesting.

00:43:06.040 --> 00:43:07.980
In here, what's going
on is-- let's say

00:43:07.980 --> 00:43:09.250
you have a base block.

00:43:09.250 --> 00:43:10.950
So you take your sequential
program.

00:43:10.950 --> 00:43:14.010
You figure out what are the
different basic blocks of

00:43:14.010 --> 00:43:17.200
computation that you have and
within the basic block you

00:43:17.200 --> 00:43:18.510
have lots of instructions.

00:43:18.510 --> 00:43:21.650
So each one of these green
boxes is a particular

00:43:21.650 --> 00:43:22.560
instruction.

00:43:22.560 --> 00:43:25.230
And what you're seeing-- these
arrows here that connect the

00:43:25.230 --> 00:43:28.090
edges-- are operands that
you have to exchange.

00:43:28.090 --> 00:43:30.320
So you might have--

00:43:33.190 --> 00:43:33.835
this is an add instruction.

00:43:33.835 --> 00:43:35.880
It requires a value
coming from here.

00:43:35.880 --> 00:43:36.970
Multiply--

00:43:36.970 --> 00:43:39.640
subtract instruction requires
values coming in from

00:43:39.640 --> 00:43:40.690
different areas.

00:43:40.690 --> 00:43:42.490
So how would you
distribute this

00:43:42.490 --> 00:43:44.630
across a number of cores--

00:43:44.630 --> 00:43:46.720
or across a number of tiles?

00:43:46.720 --> 00:43:50.150
Any ideas here?

00:43:50.150 --> 00:43:53.350
So you can look for, for
example, some chains that are

00:43:53.350 --> 00:43:55.330
not interconnected.

00:43:55.330 --> 00:43:57.540
So you can look for clusters
that you can use.

00:43:57.540 --> 00:44:00.940
And say, OK, well I see no edges
here so maybe I can put

00:44:00.940 --> 00:44:02.870
this on one tile.

00:44:02.870 --> 00:44:05.270
And then maybe I can put some
of these instructions on

00:44:05.270 --> 00:44:06.440
another tile.

00:44:06.440 --> 00:44:09.010
Because sort of the
communication flow is local.

00:44:09.010 --> 00:44:12.630
So maybe one strategy might
be, look for the longest

00:44:12.630 --> 00:44:15.000
single chains so you can keep
the communication flow.

00:44:15.000 --> 00:44:18.630
And then you apply and make and
algorithm, come up with a

00:44:18.630 --> 00:44:20.550
number of clusters.

00:44:20.550 --> 00:44:22.530
Something like that
does happen.

00:44:22.530 --> 00:44:26.070
And keep in mind from the
lectures we talked about the

00:44:26.070 --> 00:44:27.800
parallelizing compiler, you
have to worry about

00:44:27.800 --> 00:44:29.550
parallelism versus
communication.

00:44:29.550 --> 00:44:31.800
Some the more you distribute
things, the more communication

00:44:31.800 --> 00:44:33.240
you have to get right.

00:44:33.240 --> 00:44:34.640
So here we're showing--

00:44:34.640 --> 00:44:38.400
what I'm showing is color
mapping from the original

00:44:38.400 --> 00:44:41.520
instructions in the base block
to the same instructions, but

00:44:41.520 --> 00:44:44.290
now each color essential
represents a different cluster

00:44:44.290 --> 00:44:48.900
or essentially code that would
map a different thread.

00:44:48.900 --> 00:44:52.270
So blue is one thread, yellow is
another, green is another,

00:44:52.270 --> 00:44:54.260
red, purple, and so on.

00:44:54.260 --> 00:44:56.680
But I have to worry about
communication between the

00:44:56.680 --> 00:44:58.770
different colors because
they're essentially two

00:44:58.770 --> 00:44:59.960
different threads.

00:44:59.960 --> 00:45:02.320
They're going to run on two
different processors or two

00:45:02.320 --> 00:45:03.400
different tiles.

00:45:03.400 --> 00:45:08.800
So those arrows that are
highlighted in dark black are

00:45:08.800 --> 00:45:09.320
communication edges.

00:45:09.320 --> 00:45:11.860
They have to explicitly send
the operands around.

00:45:11.860 --> 00:45:14.310
Right?

00:45:14.310 --> 00:45:16.470
So then I might look
at the granularity.

00:45:16.470 --> 00:45:18.260
What is my communication cost?

00:45:18.260 --> 00:45:19.770
What is my computation cost?

00:45:19.770 --> 00:45:21.350
And I want to worry about
load balancing.

00:45:21.350 --> 00:45:26.870
As we saw, load balancing can
give you how it can better

00:45:26.870 --> 00:45:28.490
make use of your architecture
and give you better

00:45:28.490 --> 00:45:30.770
utilization, better
throughput.

00:45:30.770 --> 00:45:33.250
So you might essentially say,
it doesn't-- it's not

00:45:33.250 --> 00:45:36.650
worthwhile to have these running
on a different tile

00:45:36.650 --> 00:45:38.660
because there's a lot of
communication going on.

00:45:38.660 --> 00:45:40.290
So maybe I'd want to fuse
those together.

00:45:40.290 --> 00:45:43.870
Keep the communication local.

00:45:43.870 --> 00:45:46.940
And essentially eliminate
costly communication.

00:45:46.940 --> 00:45:48.680
So there are different
heuristics that you can apply.

00:45:48.680 --> 00:45:51.630
You can use that 5-tuple.

00:45:51.630 --> 00:45:54.310
You can use heuristic space on
the 5-tuple to determine when

00:45:54.310 --> 00:45:58.510
it's profitable to break things
up and when it's not.

00:45:58.510 --> 00:46:01.050
And then you have to worry
about placement.

00:46:01.050 --> 00:46:04.010
So you don't quite have this
on cell in that you create

00:46:04.010 --> 00:46:06.230
these SPE threads and
they can run on any

00:46:06.230 --> 00:46:08.020
SPE in the raw compiler.

00:46:08.020 --> 00:46:10.410
You can actually exploit the
spacial characteristics of the

00:46:10.410 --> 00:46:14.010
chip in the point-to-point
communication network to say,

00:46:14.010 --> 00:46:16.950
I want to put these two threads
on tile 1 and tile 2,

00:46:16.950 --> 00:46:19.300
where tile 1 and tile 2 are
adjacent to each other.

00:46:19.300 --> 00:46:21.770
Because I have a well-defined
communication pattern that I'm

00:46:21.770 --> 00:46:22.640
going to use.

00:46:22.640 --> 00:46:26.350
And map to the communication
network on the chip to get

00:46:26.350 --> 00:46:29.710
really fast, really
low latency.

00:46:29.710 --> 00:46:32.210
So you can take each one of
these colors, place it on a

00:46:32.210 --> 00:46:33.360
different tile.

00:46:33.360 --> 00:46:36.490
And now you have these wires
that are going across these

00:46:36.490 --> 00:46:39.040
tiles which essentially
represent communication.

00:46:39.040 --> 00:46:41.570
But now the tile has to worry
about, oh, I have to

00:46:41.570 --> 00:46:43.960
essentially send these
on fixed routes.

00:46:43.960 --> 00:46:46.450
There's no arbitrary
communication mechanism.

00:46:46.450 --> 00:46:50.750
So if there's data going from
this tile to this tile, it

00:46:50.750 --> 00:46:52.950
actually has to be routed
through a network.

00:46:52.950 --> 00:46:54.830
And that might mean getting
routing through somebody

00:46:54.830 --> 00:46:57.630
else's tile.

00:46:57.630 --> 00:47:00.950
So the next stage would be
communication coordination.

00:47:00.950 --> 00:47:05.510
You have to figure out which
switch you need to go to and

00:47:05.510 --> 00:47:08.210
what do you do to get that
operand to the right switch

00:47:08.210 --> 00:47:10.100
which then gets it to
the right processor.

00:47:10.100 --> 00:47:12.960
So here, I believe the heuristic
is to do dimension

00:47:12.960 --> 00:47:17.700
order routing so you send along
the x-dimension and then

00:47:17.700 --> 00:47:18.860
the y-dimension.

00:47:18.860 --> 00:47:19.650
I might have those reversed.

00:47:19.650 --> 00:47:23.210
I don't know.

00:47:23.210 --> 00:47:25.610
And then finally, now you've
figured out your communication

00:47:25.610 --> 00:47:28.190
patterns, you've figured out
your instructions, you do some

00:47:28.190 --> 00:47:29.440
instructions scheduling.

00:47:29.440 --> 00:47:31.360
And what you can do here,
because the communication

00:47:31.360 --> 00:47:33.965
patterns are static, you've
split up the instructions so

00:47:33.965 --> 00:47:38.110
you know when you need to ship
data around and how.

00:47:38.110 --> 00:47:41.010
You can guarantee deadlock
freedom by carefully ordering

00:47:41.010 --> 00:47:42.690
your send and receive pairs.

00:47:42.690 --> 00:47:46.370
So what you see here, every time
you see an instruction

00:47:46.370 --> 00:47:48.800
that needs to ship an operand
around, there's the equivalent

00:47:48.800 --> 00:47:51.590
of a route instruction
that has route east,

00:47:51.590 --> 00:47:53.330
west, north, south.

00:47:53.330 --> 00:47:56.940
There's an equivalent route
instruction on the other

00:47:56.940 --> 00:47:57.800
processors.

00:47:57.800 --> 00:48:00.590
And that allows you to
essentially analyze code and

00:48:00.590 --> 00:48:04.020
say, OK, I've laid these things
out carefully, I've

00:48:04.020 --> 00:48:06.330
orchestrated my send and
receive pairs so I can

00:48:06.330 --> 00:48:08.800
guarantee, for example, there
are no overlapping routes.

00:48:08.800 --> 00:48:12.540
Or that there are no deadlocks
because one is trying to shift

00:48:12.540 --> 00:48:14.540
the other while the other is
also trying to ship, and they

00:48:14.540 --> 00:48:19.000
both block on the shared
network link.

00:48:19.000 --> 00:48:20.740
And finally, you have the
code representation.

00:48:20.740 --> 00:48:24.050
So this is where you package
things up into object files,

00:48:24.050 --> 00:48:26.420
into essentially things
like threads.

00:48:26.420 --> 00:48:28.940
And then you can compile
them and run them.

00:48:28.940 --> 00:48:32.580
Now the question that was posed
earlier is, well there's

00:48:32.580 --> 00:48:35.290
one thing we haven't talked
about and that's branching.

00:48:35.290 --> 00:48:38.700
This is a sequential program,
it executes branches.

00:48:38.700 --> 00:48:41.605
And now I have this loop that
I've split up across a number

00:48:41.605 --> 00:48:44.990
of tiles, how do I know who's
going to do the branch?

00:48:44.990 --> 00:48:47.360
And if one tile is doing
the branch, how does it

00:48:47.360 --> 00:48:49.190
communicate with
everybody else?

00:48:49.190 --> 00:48:51.735
Or if I'm going to repeat the
branch on every file, does

00:48:51.735 --> 00:48:53.730
that mean I'm redoing
too much computation

00:48:53.730 --> 00:48:55.090
on every other tile?

00:48:55.090 --> 00:48:57.960
So control coordination is
actually quite an interesting

00:48:57.960 --> 00:49:00.030
aspect of--

00:49:00.030 --> 00:49:01.800
adds another interesting
aspect to the

00:49:01.800 --> 00:49:04.600
parallelization for raw.

00:49:04.600 --> 00:49:07.830
So what you have to
do is figure out--

00:49:07.830 --> 00:49:09.650
there are two different
ways you can do it.

00:49:09.650 --> 00:49:14.750
Because you have no mechanism
for a global message on raw,

00:49:14.750 --> 00:49:16.940
you can't say, I've taken a
branch, everybody go to this

00:49:16.940 --> 00:49:17.970
program counter.

00:49:17.970 --> 00:49:21.690
You essentially have to send
either the branch result so

00:49:21.690 --> 00:49:24.200
one tile can do the comparison,
it calculates the

00:49:24.200 --> 00:49:29.490
condition, and then it has to
communicate x to each of the

00:49:29.490 --> 00:49:32.200
different branches-- to each
of the different tiles.

00:49:32.200 --> 00:49:34.900
Or every tiles has to
essentially just replicate the

00:49:34.900 --> 00:49:37.040
control and redo the
computations.

00:49:37.040 --> 00:49:40.450
So every tile figures out what
is the condition, what are the

00:49:40.450 --> 00:49:42.700
conditions for the branch.

00:49:42.700 --> 00:49:45.130
They redundantly do that
computation and then they can

00:49:45.130 --> 00:49:47.770
all merge at the same time--

00:49:47.770 --> 00:49:49.530
at different times.

00:49:49.530 --> 00:49:52.180
So that gives you two ways
of doing the branching.

00:49:52.180 --> 00:49:56.720
If each tile's doing its own
control flow calculation, then

00:49:56.720 --> 00:49:58.560
they can essentially branch
at different times.

00:49:58.560 --> 00:50:00.790
But if they're all going to
wait for the result to

00:50:00.790 --> 00:50:02.730
compare, then it essentially
gives you points where you

00:50:02.730 --> 00:50:04.320
have to synchronize.

00:50:04.320 --> 00:50:06.510
Everybody's going to wait for
the result of the branch.

00:50:06.510 --> 00:50:08.320
But the latency could
be different.

00:50:08.320 --> 00:50:10.670
Because if I'm sending the
branch condition to one tile

00:50:10.670 --> 00:50:13.390
versus another file, and if
one's closer than the other.

00:50:13.390 --> 00:50:16.390
Then the branch that's closer to
me-- the tile that's closer

00:50:16.390 --> 00:50:18.360
to me will take that branch
earlier in time.

00:50:18.360 --> 00:50:20.850
So you get sort of the
effective of a global

00:50:20.850 --> 00:50:23.500
asynchronous branching
in either case.

00:50:23.500 --> 00:50:27.680
Does that make sense?

00:50:27.680 --> 00:50:31.400
So, in summary, the raw
architecture is really a tile

00:50:31.400 --> 00:50:31.510
microprocessor.

00:50:31.510 --> 00:50:36.340
It incorporates the best
elements from superscalars in

00:50:36.340 --> 00:50:39.460
terms of a really low latency
communication network between

00:50:39.460 --> 00:50:42.320
tiles which really cuts down
on the communication costs.

00:50:42.320 --> 00:50:45.250
And as we saw, and as probably
you've been learning,

00:50:45.250 --> 00:50:47.830
communication is really
an expensive part of

00:50:47.830 --> 00:50:52.530
parallelization on existing
multicore chips.

00:50:52.530 --> 00:50:55.670
And it's also getting the
scalability of multicores in

00:50:55.670 --> 00:50:58.920
terms of explicit parallelism
but also gives you implicit

00:50:58.920 --> 00:51:02.060
parallelism because the networks
are pipelined and

00:51:02.060 --> 00:51:04.040
they can give you
full control.

00:51:04.040 --> 00:51:06.560
So you're trying to get to the
point where you have a tile

00:51:06.560 --> 00:51:09.650
processor with scalar operand
network that allows you to do

00:51:09.650 --> 00:51:13.420
communication with a very low
cost. And it might be the case

00:51:13.420 --> 00:51:16.640
in the future that these chips
will especially be--

00:51:16.640 --> 00:51:18.925
more complex architectures will
sit on top of these so

00:51:18.925 --> 00:51:22.220
you'll use these as fundamental
building blocks.

00:51:22.220 --> 00:51:29.425
And there was the 80 chip
multicore from Intel: there

00:51:29.425 --> 00:51:31.430
have been rumors that that might
actually be something

00:51:31.430 --> 00:51:34.770
like a graphics processor that
has something like a scalar

00:51:34.770 --> 00:51:35.770
operand network because
you could

00:51:35.770 --> 00:51:39.030
communicate with a very fast--

00:51:39.030 --> 00:51:41.010
with very low latency
between tiles.

00:51:41.010 --> 00:51:44.020
And in that article which came
out a few months ago was the

00:51:44.020 --> 00:51:47.610
first time I think that I had
seen tile architectures used

00:51:47.610 --> 00:51:49.950
in literature or in
publications.

00:51:49.950 --> 00:51:53.360
So I think you'll see more of
these kinds of designs pattern

00:51:53.360 --> 00:51:57.420
appear as people scale out to
more than 2 cores, 4 cores, 8

00:51:57.420 --> 00:51:59.560
cores and so on, where you
could still communication

00:51:59.560 --> 00:52:01.370
reasonably well with caches.

00:52:01.370 --> 00:52:03.840
And that's all I prepared
for today.

00:52:03.840 --> 00:52:05.090
Any other questions?

00:52:08.190 --> 00:52:09.650
And this is a list
of people who

00:52:09.650 --> 00:52:11.520
contributed to the raw project.

00:52:11.520 --> 00:52:13.750
A lot of students who are
led by Anant and Saman.

00:52:13.750 --> 00:52:17.590
PROFESSOR AMARASINGHE:
[OBSCURED]

00:52:17.590 --> 00:52:23.050
view of what happened in our
groups and then how it relates

00:52:23.050 --> 00:52:25.070
to necessary to what you need.

00:52:25.070 --> 00:52:30.270
But this is trying to take
it to a much finer grain.

00:52:30.270 --> 00:52:33.760
Whereas in Cell, of course, the
message has to be large,

00:52:33.760 --> 00:52:36.098
you can do a lot of coarse
grain stuff.

00:52:36.098 --> 00:52:38.800
But in raw, you try to do much
more fine grain stuff.

00:52:38.800 --> 00:52:40.065
But we're going to talk
about it the next

00:52:40.065 --> 00:52:42.135
lecture on the future.

00:52:42.135 --> 00:52:43.170
[OBSCURED]

00:52:43.170 --> 00:52:46.961
AUDIENCE: [OBSCURED]

00:52:46.961 --> 00:52:49.640
Don't you need long wires
for the clock.

00:52:49.640 --> 00:52:51.230
PROFESSOR RABBAH: There's
no global clock.

00:52:51.230 --> 00:52:56.617
AUDIENCE: So you have this
network that seems to --

00:52:56.617 --> 00:53:00.326
So that the network actually
requires handshaking?

00:53:00.326 --> 00:53:00.500
Or--

00:53:00.500 --> 00:53:04.130
PROFESSOR AMARASINGHE: The way
you can do is, you can in

00:53:04.130 --> 00:53:09.650
modern processors, [OBSCURED]

00:53:09.650 --> 00:53:12.300
so since there's no long wire,
you can actually carry the

00:53:12.300 --> 00:53:14.580
clock with the data.

00:53:14.580 --> 00:53:16.370
So in the global world, the
switching here would happen

00:53:16.370 --> 00:53:20.160
when the switching here.

00:53:20.160 --> 00:53:23.490
But since there's no big wire
connecting, then that's OK.

00:53:23.490 --> 00:53:27.340
So you can deal with
clock ticking.

00:53:27.340 --> 00:53:29.187
AUDIENCE: So this is
not going to be

00:53:29.187 --> 00:53:31.128
not clock drift because--

00:53:31.128 --> 00:53:32.120
PROFESSOR AMARASINGHE: Yeah,
that's clock drift.

00:53:32.120 --> 00:53:37.320
One end of the process clock
is happening at the global

00:53:37.320 --> 00:53:38.570
instant time at the other
end of the processor.

00:53:48.130 --> 00:53:52.950
And since the wires also kind
of go in the tree, you can

00:53:52.950 --> 00:53:53.360
deal with that.

00:53:53.360 --> 00:53:55.276
AUDIENCE: Drift meaning
ticking at

00:53:55.276 --> 00:53:56.180
different rates, not just--

00:53:56.180 --> 00:53:58.363
PROFESSOR AMARASINGHE:
Yeah, I know.

00:53:58.363 --> 00:53:59.410
Basically I don't think
I can go back to it.

00:53:59.410 --> 00:54:00.740
It has a skew.

00:54:00.740 --> 00:54:05.090
There's a clock skew going
in between those.

00:54:05.090 --> 00:54:07.486
AUDIENCE: So you don't need
synchronizers between the

00:54:07.486 --> 00:54:07.640
different tiles?

00:54:07.640 --> 00:54:08.870
PROFESSOR AMARASINGHE: No, we
don't need synchronizers

00:54:08.870 --> 00:54:09.640
because tiles are local.

00:54:09.640 --> 00:54:11.400
The clock would bring
those tiles.

00:54:11.400 --> 00:54:14.210
The clock would bring two things
that communicate close

00:54:14.210 --> 00:54:17.100
enough that it fits
it in the cycle.

00:54:17.100 --> 00:54:20.335
But for example, if you get it
two very far away branches of

00:54:20.335 --> 00:54:23.169
a tree and then if you try to
communicate with them then you

00:54:23.169 --> 00:54:23.450
have a problem.

00:54:23.450 --> 00:54:27.683
Another thing is when the tree
goes here, you want to use two

00:54:27.683 --> 00:54:30.170
different branches it's
similar to going down.

00:54:30.170 --> 00:54:31.630
So you can compress
the process.

00:54:31.630 --> 00:54:32.810
So there are all these things.

00:54:32.810 --> 00:54:34.060
I mean, modern processors
really really destable.

00:54:37.310 --> 00:54:40.538
The problem occurs when you try
to connect directly from

00:54:40.538 --> 00:54:44.270
the far end of the branch to
something that gets clocked

00:54:44.270 --> 00:54:48.265
there to something that
clocks at a very

00:54:48.265 --> 00:54:48.613
early end of the branch.

00:54:48.613 --> 00:54:50.070
If you're trying to connect
those two, then the skew might

00:54:50.070 --> 00:54:51.150
be too long.

00:54:51.150 --> 00:54:53.042
Then you can get into
clock trouble.

00:54:53.042 --> 00:54:53.770
AUDIENCE: [OBSCURED]

00:54:53.770 --> 00:54:57.283
I was just worried about
this local network.

00:54:57.283 --> 00:55:04.224
[OBSCURED]

00:55:04.224 --> 00:55:11.386
AUDIENCE: Another question I had
was in the mesh, obviously

00:55:11.386 --> 00:55:15.897
the processors in the middle
have further to get to the I/O

00:55:15.897 --> 00:55:18.860
devices or to the main memory.

00:55:18.860 --> 00:55:22.641
What do you see happening as you
get to larger and larger

00:55:22.641 --> 00:55:23.144
processors?

00:55:23.144 --> 00:55:25.282
Are they going to just put more
and more local memory on

00:55:25.282 --> 00:55:26.858
the tile and [OBSCURED]

00:55:26.858 --> 00:55:30.500
it, or are they going to add
extra memory buses on it?

00:55:30.500 --> 00:55:32.225
PROFESSOR RABBAH: It could
be a combination of both.

00:55:32.225 --> 00:55:35.950
So it's not just memory,
I/O devices.

00:55:35.950 --> 00:55:38.370
If you're doing I/O then you
might to be placed at a part

00:55:38.370 --> 00:55:42.165
of the chip that has direct
access to an I/O device or

00:55:42.165 --> 00:55:43.550
very close.

00:55:43.550 --> 00:55:46.600
It also comes up in the case
of the communication

00:55:46.600 --> 00:55:47.470
orchestration.

00:55:47.470 --> 00:55:51.680
So if this guy is doing the
branch then you want him

00:55:51.680 --> 00:55:53.270
essentially centrally located.

00:55:53.270 --> 00:55:56.150
So the best patterns for
allocating things is

00:55:56.150 --> 00:55:57.020
essentially across.

00:55:57.020 --> 00:55:59.670
It's like a plus sign where
it branches in the middle.

00:55:59.670 --> 00:56:02.470
PROFESSOR AMARASINGHE: But
that's not [OBSCURED].

00:56:02.470 --> 00:56:07.420
You can make them uniform by
everybody equally there.

00:56:07.420 --> 00:56:11.624
And a lot of times people have
done that simple model with

00:56:11.624 --> 00:56:16.353
everybody equally there Or you
try to take advantage of

00:56:16.353 --> 00:56:16.670
closeness and stuff like that.

00:56:16.670 --> 00:56:16.950
So you can't have both ways.

00:56:16.950 --> 00:56:19.730
So anytime you try to
make me [OBSCURED]

00:56:19.730 --> 00:56:24.580
very, very close and fast
access, you're are doing it by

00:56:24.580 --> 00:56:30.652
basically making the other parts
to have less resources

00:56:30.652 --> 00:56:32.240
and less access.

00:56:32.240 --> 00:56:34.760
On the other hand, there are
a lot of people working on

00:56:34.760 --> 00:56:38.690
[INAUDIBLE]

00:56:38.690 --> 00:56:41.655
things that, for example,
there's a thing called tree

00:56:41.655 --> 00:56:42.990
space laser.

00:56:42.990 --> 00:56:45.920
So what that does is you put a
mirror on top of the tile, on

00:56:45.920 --> 00:56:48.500
top of the processor.

00:56:48.500 --> 00:56:58.920
And each of these-- you can
embed a small LED transmitter

00:56:58.920 --> 00:56:59.490
into the chip.

00:56:59.490 --> 00:57:01.435
So basically if you want to
communicate with someone, you

00:57:01.435 --> 00:57:03.740
just bounce that laser on
top of that and get it

00:57:03.740 --> 00:57:04.660
to the right guy.

00:57:04.660 --> 00:57:07.100
So there are a lot of exotic
things that might be able to

00:57:07.100 --> 00:57:09.150
solve this thing, technological
problem.

00:57:09.150 --> 00:57:11.160
But in some case,
speed of light--

00:57:11.160 --> 00:57:14.860
I don't think an engineer
has figured out how to

00:57:14.860 --> 00:57:15.430
break speed of light.

00:57:15.430 --> 00:57:17.925
Unless, of course, people go
with quantum computing and

00:57:17.925 --> 00:57:18.870
stuff like that.

00:57:18.870 --> 00:57:21.930
So, I mean the key thing is, you
have resources, you have

00:57:21.930 --> 00:57:22.660
certain data and you just
have to deal with it.

00:57:22.660 --> 00:57:26.190
Getting nice uniformity
has a cost.

00:57:26.190 --> 00:57:27.340
PROFESSOR RABBAH: Yeah, I
mean, on the [OBSCURED]

00:57:27.340 --> 00:57:30.650
that are groups here at MIT
who are working on optical

00:57:30.650 --> 00:57:32.210
networks in the third
dimension.

00:57:32.210 --> 00:57:33.956
So you have a tile chip plus
an optical network in the

00:57:33.956 --> 00:57:35.990
third dimension which allows
you to do things like

00:57:35.990 --> 00:57:38.214
broadcast much more
efficiently.

00:57:38.214 --> 00:57:38.752
OK?

00:57:38.752 --> 00:57:40.300
PROFESSOR AMARASINGHE: I guess
we'll take a break here and

00:57:40.300 --> 00:57:42.286
take a small, three-minute break
and then we can go on to

00:57:42.286 --> 00:57:43.536
the next topic.