1
00:00:00,770 --> 00:00:05,600
Okay, one more cache design decision to make,
then we're done!

2
00:00:05,600 --> 00:00:09,130
How should we handle memory writes in the
cache?

3
00:00:09,130 --> 00:00:15,640
Ultimately we'll need update main memory with
the new data, but when should that happen?

4
00:00:15,640 --> 00:00:18,290
The most obvious choice is to perform the
write immediately.

5
00:00:18,290 --> 00:00:24,590
In other words, whenever the CPU sends a write
request to the cache, the cache then performs

6
00:00:24,590 --> 00:00:27,280
the same write to main memory.

7
00:00:27,280 --> 00:00:30,050
This is called "write-through".

8
00:00:30,050 --> 00:00:34,640
That way main memory always has the most up-to-date
value for all locations.

9
00:00:34,640 --> 00:00:40,140
But this can be slow if the CPU has to wait
for a DRAM write access - writes could become

10
00:00:40,140 --> 00:00:42,440
a real bottleneck!

11
00:00:42,440 --> 00:00:48,120
And what if the program is constantly writing
a particular memory location, e.g., updating

12
00:00:48,120 --> 00:00:52,260
the value of a local variable in the current
stack frame?

13
00:00:52,260 --> 00:00:56,880
In the end we only need to write the last
value to main memory.

14
00:00:56,880 --> 00:01:00,960
Writing all the earlier values is waste of
memory bandwidth.

15
00:01:00,960 --> 00:01:05,649
Suppose we let the CPU continue execution
while the cache waits for the write to main

16
00:01:05,649 --> 00:01:09,729
memory to complete - this is called "write-behind".

17
00:01:09,729 --> 00:01:14,179
This will overlap execution of the program
with the slow writes to main memory.

18
00:01:14,179 --> 00:01:19,549
Of course, if there's another cache miss while
the write is still pending, everything will

19
00:01:19,549 --> 00:01:24,090
have to wait at that point until both the
write and subsequent refill read finish, since

20
00:01:24,090 --> 00:01:29,929
the CPU can't proceed until the cache miss
is resolved.

21
00:01:29,929 --> 00:01:34,479
The best strategy is called "write-back" where
the contents of the cache are updated and

22
00:01:34,479 --> 00:01:36,561
the CPU continues execution immediately.

23
00:01:36,561 --> 00:01:43,060
The updated cache value is only written to
main memory when the cache line is chosen

24
00:01:43,060 --> 00:01:47,029
as the replacement line for a cache miss.

25
00:01:47,029 --> 00:01:52,189
This strategy minimizes the number of accesses
to main memory, preserving the memory bandwidth

26
00:01:52,189 --> 00:01:54,549
for other operations.

27
00:01:54,549 --> 00:01:58,810
This is the strategy used by most modern processors.

28
00:01:58,810 --> 00:02:00,450
Write-back is easy to implement.

29
00:02:00,450 --> 00:02:05,289
Returning to our original cache recipe, we
simply eliminate the start of the write to

30
00:02:05,289 --> 00:02:08,348
main memory when there's a write request to
the cache.

31
00:02:08,348 --> 00:02:12,220
We just update the cache contents and leave
it at that.

32
00:02:12,220 --> 00:02:17,790
However, replacing a cache line becomes a
more complex operation, since we can't reuse

33
00:02:17,790 --> 00:02:22,780
the cache line without first writing its contents
back to main memory in case they had been

34
00:02:22,780 --> 00:02:26,180
modified by an earlier write access.

35
00:02:26,180 --> 00:02:27,450
Hmm.

36
00:02:27,450 --> 00:02:31,890
Seems like this does a write-back of all replaced
cache lines whether or not they've been written

37
00:02:31,890 --> 00:02:34,470
to.

38
00:02:34,470 --> 00:02:39,500
We can avoid unnecessary write-backs by adding
another state bit to each cache line: the

39
00:02:39,500 --> 00:02:41,329
"dirty" bit.

40
00:02:41,329 --> 00:02:47,480
The dirty bit is set to 0 when a cache line
is filled during a cache miss.

41
00:02:47,480 --> 00:02:52,380
If a subsequent write operation changes the
data in a cache line, the dirty bit is set

42
00:02:52,380 --> 00:02:58,380
to 1, indicating that value in the cache now
differs from the value in main memory.

43
00:02:58,380 --> 00:03:03,080
When a cache line is selected for replacement,
we only need to write its data back to main

44
00:03:03,080 --> 00:03:06,220
memory if its dirty bit is 1.

45
00:03:06,220 --> 00:03:11,150
So a write-back strategy with a dirty bit
gives an elegant solution that minimizes the

46
00:03:11,150 --> 00:03:17,360
number of writes to main memory and only delays
the CPU on a cache miss if a dirty cache line

47
00:03:17,360 --> 00:03:20,860
needs to be written back to memory.

48
00:03:20,860 --> 00:03:25,780
That concludes our discussion of caches, which
was motivated by our desire to minimize the

49
00:03:25,780 --> 00:03:31,980
average memory access time by building a hierarchical
memory system that had both low latency and

50
00:03:31,980 --> 00:03:34,510
high capacity.

51
00:03:34,510 --> 00:03:39,000
There were a number of strategies we employed
to achieve our goal.

52
00:03:39,000 --> 00:03:45,920
Increasing the number of cache lines decreases
AMAT by decreasing the miss ratio.

53
00:03:45,920 --> 00:03:50,650
Increasing the block size of the cache let
us take advantage of the fast column accesses

54
00:03:50,650 --> 00:03:56,400
in a DRAM to efficiently load a whole block
of data on a cache miss.

55
00:03:56,400 --> 00:04:01,890
The expectation was that this would improve
AMAT by increasing the number of hits in the

56
00:04:01,890 --> 00:04:07,300
future as accesses were made to nearby locations.

57
00:04:07,300 --> 00:04:12,190
Increasing the number of ways in the cache
reduced the possibility of cache line conflicts,

58
00:04:12,190 --> 00:04:14,500
lowering the miss ratio.

59
00:04:14,500 --> 00:04:19,548
Choosing the least-recently used cache line
for replacement minimized the impact of replacement

60
00:04:19,548 --> 00:04:21,009
on the hit ratio.

61
00:04:21,009 --> 00:04:27,430
And, finally, we chose to handle writes using
a write-back strategy with dirty bits.

62
00:04:27,430 --> 00:04:31,069
How do we make the tradeoffs among all these
architectural choices?

63
00:04:31,069 --> 00:04:38,009
As usual, we'll simulate different cache organizations
and chose the architectural mix that provides

64
00:04:38,009 --> 00:04:40,629
the best performance on our benchmark programs.