1
00:00:00,599 --> 00:00:05,279
For some applications, data naturally comes
in vector or matrix form.

2
00:00:05,279 --> 00:00:10,980
For example, a vector of digitized samples
representing an audio waveform over time,

3
00:00:10,980 --> 00:00:15,940
or a matrix of pixel colors in a 2D image
from a camera.

4
00:00:15,940 --> 00:00:19,860
When processing that data, it's common to
perform the same sequence of operations on

5
00:00:19,860 --> 00:00:21,660
each data element.

6
00:00:21,660 --> 00:00:26,750
The example code shown here is computing a
vector sum, where each component of one vector

7
00:00:26,750 --> 00:00:31,080
is added to the corresponding component of
another vector.

8
00:00:31,080 --> 00:00:36,390
By replicating the datapath portion of our
CPU, we can design special-purpose vector

9
00:00:36,390 --> 00:00:41,640
processors capable of performing the same
operation on many data elements in parallel.

10
00:00:41,640 --> 00:00:46,449
Here we see that the register file and ALU
have been replicated and the control signals

11
00:00:46,449 --> 00:00:51,819
from decoding the current instruction are
shared by all the datapaths.

12
00:00:51,819 --> 00:00:56,290
Data is fetched from memory in big blocks
(very much like fetching a cache line) and

13
00:00:56,290 --> 00:01:02,329
the specified register in each datapath is
loaded with one of the words from the block.

14
00:01:02,329 --> 00:01:06,600
Similarly each datapath can contribute a word
to be stored as a contiguous block in main

15
00:01:06,600 --> 00:01:08,369
memory.

16
00:01:08,369 --> 00:01:13,770
In such machines, the width of the data buses
to and from main memory is many words wide,

17
00:01:13,770 --> 00:01:20,390
so a single memory access provides data for
all the datapaths in parallel.

18
00:01:20,390 --> 00:01:24,890
Executing a single instruction on a machine
with N datapaths is equivalent to executing

19
00:01:24,890 --> 00:01:29,289
N instructions on a conventional machine with
a single datapath.

20
00:01:29,289 --> 00:01:35,420
The result achieves a lot of parallelism without
the complexities of out-of-order superscalar

21
00:01:35,420 --> 00:01:36,420
execution.

22
00:01:36,420 --> 00:01:39,719
Suppose we had a vector processor with 16
datapaths.

23
00:01:39,719 --> 00:01:44,630
Let's compare its performance on a vector-sum
operation to that of a conventional pipelined

24
00:01:44,630 --> 00:01:46,490
Beta processor.

25
00:01:46,490 --> 00:01:51,658
Here's the Beta code, carefully organized
to avoid any data hazards during execution.

26
00:01:51,658 --> 00:01:56,670
There are 9 instructions in the loop, taking
10 cycles to execute if we count the NOP introduced

27
00:01:56,670 --> 00:02:00,990
into the pipeline when the BNE at the end
of the loop is taken.

28
00:02:00,990 --> 00:02:06,780
It takes 160 cycles to sum all 16 elements
assuming no additional cycles are required

29
00:02:06,780 --> 00:02:09,110
due to cache misses.

30
00:02:09,110 --> 00:02:14,010
And here's the corresponding code for a vector
processor where we've assumed constant-sized

31
00:02:14,010 --> 00:02:16,210
16-element vectors.

32
00:02:16,210 --> 00:02:20,579
Note that "V" registers refer to a particular
location in the register file associated with

33
00:02:20,579 --> 00:02:25,890
each datapath, while the "R" registers are
the conventional Beta registers used for address

34
00:02:25,890 --> 00:02:26,900
computations, etc.

35
00:02:26,900 --> 00:02:32,200
It would only take 4 cycles for the vector
processor to complete the desired operations,

36
00:02:32,200 --> 00:02:34,520
a speed-up of 40.

37
00:02:34,520 --> 00:02:37,579
This example shows the best-possible speed-up.

38
00:02:37,579 --> 00:02:42,329
The key to a good speed-up is our ability
to "vectorize" the code and take advantage

39
00:02:42,329 --> 00:02:45,850
of all the datapaths operating in parallel.

40
00:02:45,850 --> 00:02:51,150
This isn't possible for every application,
but for tasks like audio or video encoding

41
00:02:51,150 --> 00:02:58,190
and decoding, and all sorts of digital signal
processing, vectorization is very doable.

42
00:02:58,190 --> 00:03:03,810
Memory operations enjoy a similar performance
improvement since the access overhead is amortized

43
00:03:03,810 --> 00:03:06,530
over large blocks of data.

44
00:03:06,530 --> 00:03:11,220
You might wonder if it's possible to efficiently
perform data-dependent operations on a vector

45
00:03:11,220 --> 00:03:13,520
processor.

46
00:03:13,520 --> 00:03:17,480
Data-dependent operations appear as conditional
statements on conventional machines, where

47
00:03:17,480 --> 00:03:21,710
the body of the statement is executed if the
condition is true.

48
00:03:21,710 --> 00:03:25,810
If testing and branching is under the control
of the single-instruction execution engine,

49
00:03:25,810 --> 00:03:30,770
how can we take advantage of the parallel
datapaths?

50
00:03:30,770 --> 00:03:35,380
The trick is provide each datapath with a
local predicate flag.

51
00:03:35,380 --> 00:03:41,510
Use a vectorized compare instruction (CMPLT.V)
to perform the a[i] < b[i] comparisons in

52
00:03:41,510 --> 00:03:46,900
parallel and remember the result locally in
each datapath's predicate flag.

53
00:03:46,900 --> 00:03:53,290
Then extend the vector ISA to include "predicated
instructions" which check the local predicate

54
00:03:53,290 --> 00:03:56,530
to see if they should execute or do nothing.

55
00:03:56,530 --> 00:04:03,790
In this example, ADDC.V.iftrue only performs
the ADDC on the local data if the local predicate

56
00:04:03,790 --> 00:04:06,170
flag is true.

57
00:04:06,170 --> 00:04:11,100
Instruction predication is also used in many
non-vector architectures to avoid the execution-time

58
00:04:11,100 --> 00:04:14,510
penalties associated with mis-predicted conditional
branches.

59
00:04:14,510 --> 00:04:20,959
They are particularly useful for simple arithmetic
and boolean operations (i.e., very short instruction

60
00:04:20,959 --> 00:04:25,630
sequences) that should be executed only if
a condition is met.

61
00:04:25,630 --> 00:04:34,909
The x86 ISA includes a conditional move instruction,
and in the 32-bit ARM ISA almost all instructions

62
00:04:34,909 --> 00:04:37,620
can be conditionally executed.

63
00:04:37,620 --> 00:04:43,220
The power of vector processors comes from
having 1 instruction initiate N parallel operations

64
00:04:43,220 --> 00:04:46,170
on N pairs of operands.

65
00:04:46,170 --> 00:04:52,550
Most modern CPUs incorporate vector extensions
that operate in parallel on 8-, 16-, 32- or

66
00:04:52,550 --> 00:04:59,360
64-bit operands organized as blocks of 128-,
256-, or 512-bit data.

67
00:04:59,360 --> 00:05:04,660
Often all that's needed is some simple additional
logic on an ALU designed to process full-width

68
00:05:04,660 --> 00:05:06,130
operands.

69
00:05:06,130 --> 00:05:11,990
The parallelism is baked into the vector program,
not discovered on-the-fly by the instruction

70
00:05:11,990 --> 00:05:15,000
dispatch and execution machinery.

71
00:05:15,000 --> 00:05:19,200
Writing the specialized vector programs is
a worthwhile investment for certain library

72
00:05:19,200 --> 00:05:24,210
functions which see a lot use in processing
today's information streams with their heavy

73
00:05:24,210 --> 00:05:28,550
use of images, and A/V material.

74
00:05:28,550 --> 00:05:33,300
Perhaps the best example of architectures
with many datapaths operating in parallel

75
00:05:33,300 --> 00:05:39,550
are the graphics processing units (GPUs) found
in almost all computer graphics systems.

76
00:05:39,550 --> 00:05:46,370
GPU datapaths are typically specialized for
32- and 64-bit floating point operations found

77
00:05:46,370 --> 00:05:52,080
in the algorithms needed to display in real-time
a 3D scene represented as billions of triangular

78
00:05:52,080 --> 00:05:55,760
patches as a 2D image on the computer screen.

79
00:05:55,760 --> 00:06:01,620
Coordinate transformation, pixel shading and
antialiasing, texture mapping, etc., are examples

80
00:06:01,620 --> 00:06:06,180
of "embarrassingly parallel" computations
where the parallelism comes from having to

81
00:06:06,180 --> 00:06:12,210
perform the same computation independently
on millions of different data objects.

82
00:06:12,210 --> 00:06:17,700
Similar problems can be found in the fields
of bioinformatics, big data processing, neural

83
00:06:17,700 --> 00:06:21,340
net emulation used in deep machine learning,
and so on.

84
00:06:21,340 --> 00:06:27,390
Increasingly, GPUs are used in many interesting
scientific and engineering calculations and

85
00:06:27,390 --> 00:06:30,940
not just as graphics engines.

86
00:06:30,940 --> 00:06:35,600
Data-level parallelism provides significant
performance improvements in a variety of useful

87
00:06:35,600 --> 00:06:37,090
situations.

88
00:06:37,090 --> 00:06:43,170
So current and future ISAs will almost certainly
include support for vector operations.