1 00:00:00,599 --> 00:00:05,279 For some applications, data naturally comes in vector or matrix form. 2 00:00:05,279 --> 00:00:10,980 For example, a vector of digitized samples representing an audio waveform over time, 3 00:00:10,980 --> 00:00:15,940 or a matrix of pixel colors in a 2D image from a camera. 4 00:00:15,940 --> 00:00:19,860 When processing that data, it's common to perform the same sequence of operations on 5 00:00:19,860 --> 00:00:21,660 each data element. 6 00:00:21,660 --> 00:00:26,750 The example code shown here is computing a vector sum, where each component of one vector 7 00:00:26,750 --> 00:00:31,080 is added to the corresponding component of another vector. 8 00:00:31,080 --> 00:00:36,390 By replicating the datapath portion of our CPU, we can design special-purpose vector 9 00:00:36,390 --> 00:00:41,640 processors capable of performing the same operation on many data elements in parallel. 10 00:00:41,640 --> 00:00:46,449 Here we see that the register file and ALU have been replicated and the control signals 11 00:00:46,449 --> 00:00:51,819 from decoding the current instruction are shared by all the datapaths. 12 00:00:51,819 --> 00:00:56,290 Data is fetched from memory in big blocks (very much like fetching a cache line) and 13 00:00:56,290 --> 00:01:02,329 the specified register in each datapath is loaded with one of the words from the block. 14 00:01:02,329 --> 00:01:06,600 Similarly each datapath can contribute a word to be stored as a contiguous block in main 15 00:01:06,600 --> 00:01:08,369 memory. 16 00:01:08,369 --> 00:01:13,770 In such machines, the width of the data buses to and from main memory is many words wide, 17 00:01:13,770 --> 00:01:20,390 so a single memory access provides data for all the datapaths in parallel. 18 00:01:20,390 --> 00:01:24,890 Executing a single instruction on a machine with N datapaths is equivalent to executing 19 00:01:24,890 --> 00:01:29,289 N instructions on a conventional machine with a single datapath. 20 00:01:29,289 --> 00:01:35,420 The result achieves a lot of parallelism without the complexities of out-of-order superscalar 21 00:01:35,420 --> 00:01:36,420 execution. 22 00:01:36,420 --> 00:01:39,719 Suppose we had a vector processor with 16 datapaths. 23 00:01:39,719 --> 00:01:44,630 Let's compare its performance on a vector-sum operation to that of a conventional pipelined 24 00:01:44,630 --> 00:01:46,490 Beta processor. 25 00:01:46,490 --> 00:01:51,658 Here's the Beta code, carefully organized to avoid any data hazards during execution. 26 00:01:51,658 --> 00:01:56,670 There are 9 instructions in the loop, taking 10 cycles to execute if we count the NOP introduced 27 00:01:56,670 --> 00:02:00,990 into the pipeline when the BNE at the end of the loop is taken. 28 00:02:00,990 --> 00:02:06,780 It takes 160 cycles to sum all 16 elements assuming no additional cycles are required 29 00:02:06,780 --> 00:02:09,110 due to cache misses. 30 00:02:09,110 --> 00:02:14,010 And here's the corresponding code for a vector processor where we've assumed constant-sized 31 00:02:14,010 --> 00:02:16,210 16-element vectors. 32 00:02:16,210 --> 00:02:20,579 Note that "V" registers refer to a particular location in the register file associated with 33 00:02:20,579 --> 00:02:25,890 each datapath, while the "R" registers are the conventional Beta registers used for address 34 00:02:25,890 --> 00:02:26,900 computations, etc. 35 00:02:26,900 --> 00:02:32,200 It would only take 4 cycles for the vector processor to complete the desired operations, 36 00:02:32,200 --> 00:02:34,520 a speed-up of 40. 37 00:02:34,520 --> 00:02:37,579 This example shows the best-possible speed-up. 38 00:02:37,579 --> 00:02:42,329 The key to a good speed-up is our ability to "vectorize" the code and take advantage 39 00:02:42,329 --> 00:02:45,850 of all the datapaths operating in parallel. 40 00:02:45,850 --> 00:02:51,150 This isn't possible for every application, but for tasks like audio or video encoding 41 00:02:51,150 --> 00:02:58,190 and decoding, and all sorts of digital signal processing, vectorization is very doable. 42 00:02:58,190 --> 00:03:03,810 Memory operations enjoy a similar performance improvement since the access overhead is amortized 43 00:03:03,810 --> 00:03:06,530 over large blocks of data. 44 00:03:06,530 --> 00:03:11,220 You might wonder if it's possible to efficiently perform data-dependent operations on a vector 45 00:03:11,220 --> 00:03:13,520 processor. 46 00:03:13,520 --> 00:03:17,480 Data-dependent operations appear as conditional statements on conventional machines, where 47 00:03:17,480 --> 00:03:21,710 the body of the statement is executed if the condition is true. 48 00:03:21,710 --> 00:03:25,810 If testing and branching is under the control of the single-instruction execution engine, 49 00:03:25,810 --> 00:03:30,770 how can we take advantage of the parallel datapaths? 50 00:03:30,770 --> 00:03:35,380 The trick is provide each datapath with a local predicate flag. 51 00:03:35,380 --> 00:03:41,510 Use a vectorized compare instruction (CMPLT.V) to perform the a[i] < b[i] comparisons in 52 00:03:41,510 --> 00:03:46,900 parallel and remember the result locally in each datapath's predicate flag. 53 00:03:46,900 --> 00:03:53,290 Then extend the vector ISA to include "predicated instructions" which check the local predicate 54 00:03:53,290 --> 00:03:56,530 to see if they should execute or do nothing. 55 00:03:56,530 --> 00:04:03,790 In this example, ADDC.V.iftrue only performs the ADDC on the local data if the local predicate 56 00:04:03,790 --> 00:04:06,170 flag is true. 57 00:04:06,170 --> 00:04:11,100 Instruction predication is also used in many non-vector architectures to avoid the execution-time 58 00:04:11,100 --> 00:04:14,510 penalties associated with mis-predicted conditional branches. 59 00:04:14,510 --> 00:04:20,959 They are particularly useful for simple arithmetic and boolean operations (i.e., very short instruction 60 00:04:20,959 --> 00:04:25,630 sequences) that should be executed only if a condition is met. 61 00:04:25,630 --> 00:04:34,909 The x86 ISA includes a conditional move instruction, and in the 32-bit ARM ISA almost all instructions 62 00:04:34,909 --> 00:04:37,620 can be conditionally executed. 63 00:04:37,620 --> 00:04:43,220 The power of vector processors comes from having 1 instruction initiate N parallel operations 64 00:04:43,220 --> 00:04:46,170 on N pairs of operands. 65 00:04:46,170 --> 00:04:52,550 Most modern CPUs incorporate vector extensions that operate in parallel on 8-, 16-, 32- or 66 00:04:52,550 --> 00:04:59,360 64-bit operands organized as blocks of 128-, 256-, or 512-bit data. 67 00:04:59,360 --> 00:05:04,660 Often all that's needed is some simple additional logic on an ALU designed to process full-width 68 00:05:04,660 --> 00:05:06,130 operands. 69 00:05:06,130 --> 00:05:11,990 The parallelism is baked into the vector program, not discovered on-the-fly by the instruction 70 00:05:11,990 --> 00:05:15,000 dispatch and execution machinery. 71 00:05:15,000 --> 00:05:19,200 Writing the specialized vector programs is a worthwhile investment for certain library 72 00:05:19,200 --> 00:05:24,210 functions which see a lot use in processing today's information streams with their heavy 73 00:05:24,210 --> 00:05:28,550 use of images, and A/V material. 74 00:05:28,550 --> 00:05:33,300 Perhaps the best example of architectures with many datapaths operating in parallel 75 00:05:33,300 --> 00:05:39,550 are the graphics processing units (GPUs) found in almost all computer graphics systems. 76 00:05:39,550 --> 00:05:46,370 GPU datapaths are typically specialized for 32- and 64-bit floating point operations found 77 00:05:46,370 --> 00:05:52,080 in the algorithms needed to display in real-time a 3D scene represented as billions of triangular 78 00:05:52,080 --> 00:05:55,760 patches as a 2D image on the computer screen. 79 00:05:55,760 --> 00:06:01,620 Coordinate transformation, pixel shading and antialiasing, texture mapping, etc., are examples 80 00:06:01,620 --> 00:06:06,180 of "embarrassingly parallel" computations where the parallelism comes from having to 81 00:06:06,180 --> 00:06:12,210 perform the same computation independently on millions of different data objects. 82 00:06:12,210 --> 00:06:17,700 Similar problems can be found in the fields of bioinformatics, big data processing, neural 83 00:06:17,700 --> 00:06:21,340 net emulation used in deep machine learning, and so on. 84 00:06:21,340 --> 00:06:27,390 Increasingly, GPUs are used in many interesting scientific and engineering calculations and 85 00:06:27,390 --> 00:06:30,940 not just as graphics engines. 86 00:06:30,940 --> 00:06:35,600 Data-level parallelism provides significant performance improvements in a variety of useful 87 00:06:35,600 --> 00:06:37,090 situations. 88 00:06:37,090 --> 00:06:43,170 So current and future ISAs will almost certainly include support for vector operations.