6.033 | Spring 2018 | Undergraduate

Computer System Engineering

Week 8: Distributed Systems Part I

Lecture 14: Fault Tolerance: Reliability via Replication

Lecture 14 Outline

  1. Introduction
  2. Building Fault-Tolerant Systems
  3. Quantifying Reliability
  4. Reliability via Replication
  5. Dealing with Disk Failures
  6. Whole-Disk Failures
  7. Your Future

Lecture Slides

Reading

  • Book sections 8.1, 8.2, and 8.3

Recitation 14: Distributed Storage

Lecture 15: Fault Tolerance: Introduction to Transactions

Lecture 15 Outline

  1. Introduction
  2. Atomicity
  3. Achieving Atomicity
  4. Achieving Atomicity with Shadow Copies
  5. Making Rename Atomic
  6. Single-Sector Writes
  7. Recovering the Disk
  8. Shadow Copies: A Summary
  9. Transactions
  10. Isolation
  11. The Future

Lecture Slides

Reading

  • Book sections 9.1, 9.2.1, and 9.2.2

Recitation 15: No Recitation; Prepare for the Quiz instead

Quiz 1

Quiz 1 will last two hours. The quiz will cover all the material up to and including Recitation 13 (CDNs). The quiz will be “open book.” That means you can bring along any printed or written materials that you think might be useful. Calculators are allowed, though typically not necessary. You may also bring a laptop to view, e.g., PDF versions of papers and notes, but you may not connect to any network; make sure you download the papers to your laptop before the quiz. Charge your laptops before you come; we cannot guarantee outlet availability.

Tutorial 8: Design Project Presentation

Your presentation should reflect the feedback you got on your preliminary report; feedback on your presentation should inform your final report. Your presentation will focus on any changes you have made since the preliminary report, rather than re-capping the entire system. See the Design Project section for more information.

Design Project Pressentation (DPP)

Read “The Google File System” by S. Ghemawat, H. Gobioff & S-T Leung. You’ve seen GFS before: It is the system that MapReduce relied on to replicate files.

GFS is a system that replicates files across machines. It’s meant for an environment where lots of users are writing to the files, the files are really big, and failures are common. Section 2–4 of the paper describe the design of GFS, Section 5 discusses how GFS handles failures, and Sections 6–7 detail their evaluation and real-world usage of GFS.

To check whether you understand the design of GFS, you should be able to answer the following questions: What is the role of the master? How does a read work? How does a write work?

As you read, think about:

  • Why does GFS use a large chunk size?
  • What happens if a replica in GFS fails?
  • What happens if the master fails?

Questions for Recitation

Before you come to this recitation, write up (on paper) a brief answer to the following (really—we don’t need more than a couple sentences for each question). 

Your answers to these questions should be in your own words, not direct quotations from the paper.

  • What assumptions does GFS rely on?
  • How does it exploit those assumptions?
  • Why does GFS make those assumptions?

As always, there are multiple correct answers for each of these questions.

  1. Introduction
    • Done with OSes, networking.
    • Now: How to systematically deal with failures, or build “fault-tolerant” systems.
      • We’ll allow more complicated failures and also try to recover from failures.
    • Thinking about large, distributed systems. 100s, 1000s, even more machines, potentially located across the globe.
    • Will also have to think about what these applications are doing, what they need.
  2. Building Fault-Tolerant Systems
    • General approach:
      1. Identify possible faults (software, hardware, design, operation, environment, …).
      2. Detect and contain.
      3. Handle the fault.
        • Do nothing, fail-fast (detect and report to next higher-level), fail-stop (detect and stop), mask, …
    • Caveats
      • Components are always unreliable. We aim to build a reliable system out of them, but our guarantees will be probabilistic.
      • Reliability comes at a cost; always a tradeoff. Common tradeoff is reliability vs. simplicity.
      • All of this is tricky. It’s easy to miss some possible faults in step 1, e.g. Hence, we iterate.
      • We’ll have to rely on *some* code to work correctly. In practice, there is only a small portion of mission-critical code. We have stringent development processes for those components.
  3. Quantifying Reliability
    • Goal: Increase availability.
    • Metrics:
          MTTF = mean time to failure
          MTTR = mean time to repair
          availability = MTTF / (MTTF + MTTR)
    • Example: Suppose my OS crashes once every month, and takes 10 minutes to recover.
          MTTF = 30 days = 720 hours = 43,200 minutes
          MTTR = 10 minutes
          availability = 43,200 / 43,210 = .9997
  4. Reliability via Replication
    • To improve reliability, add redundancy.
    • One way to add redundancy: replication.
    • Today: Replication within a single machine to deal with disk failures.
      • Tomorrow in recitation: replication across machines to deal with machine failures.
  5. Dealing with Disk Failures
    • Why disks?
      • Starting from single machine because we want to improve reliability there first before we move to multiple machines.
      • Disks in particular because if disk fails, your data is gone. Can replace other components like CPU easily. Cost of disk failure is high.
    • Are disk failures frequent?
      • Manufactures claim MTBF is 700K+ hours, which is bogus.
        • Likely: Ran 1000 disks for 3000 hours (125 days) => 3 million hours total, had 4 failures, and concluded: 1 failure every 750,000 hours.
      • But failures aren’t memoryless: Disk is more likely to fail at beginning of its lifespan and the end than in the middle (see slides).
  6. Whole-Disk Failures
    • General scenario: Entire disk fails, all data on that disk is lost. What to do? RAID provides a suite of techniques.
    • RAID 1: Mirror data across 2 disks.
      • Pro: Can handle single-disk failures.
      • Pro: Performance improvement on reads (issue two in parallel), not a terrible performance hit on writes (have to issue two writes, but you can issue them in parallel too).
      • Con: To mirror N disks’ worth of data, you need 2N disks.
    • RAID 4: With N disks, add an additional parity disk. Sector i on the parity disk is the XOR of all of the sector i’s from the data disk.
      • Pro: Can handle single-disk failures (if one disk fails, xor the other disks to recover its data).
        • Can use same technique to recover from single-sector errors.
      • Pro: To store N disks’ worth of data we only need N+1 disks.
      • Pro: Improved performance if you stripe files across the array. E.g., an N-sector-length file can be stored as one sector per disk. Reading the whole file means N parallel 1-sector reads instead of 1 long N-sector read.
        • RAID is a system for reliability, but we never forget about performance, and in fact performance influenced much of the design of RAID.
      • Con: Every write hits the parity disk.
    • RAID 5: Same as RAID 4, except intersperse the parity sectors amongst all N+1 disks to load balance writes.
      • You need a way to figure out which disk holds the parity sector for sector i, but it’s not hard.
    • RAID 5 used in practice, but falling out in favor of RAID 6, which uses the same techniques but provides protection against two disks failing at the same time.
  7. Your Future
    • RAID, and even replication, don’t solve everything.
      • E.g., what about failures that aren’t independent?
    • Wednesday: We’ll introduce transactions, which let us make some abstractions to reason about faults.
    • Next-week: We’ll get transaction-based systems to perform well on a single machine.
    • Week after: We’ll get everything to work across machines.
  1. Introduction  
    • Main goal: Build a reliable system out of unreliable components.
    • Last two days: Reliability via replication. GFS gave you replication to tolerate machine failures as long as you had some sweeping simplifications.
    • Replication lets us mask failures from users…
    • …But doesn’t solve all of our problems. Can’t replicate everything.
    • For failures that replication can’t handle: Need to reason about them and then deal with them. The reasoning is hard.
  2. Atomicity  
    • Starting today, we want to achieve “atomicity”: Atomic actions happen entirely or not at all. (Sometimes this is called “all-or-nothing atomicity” instead of just atomicity.)
    • Why atomicity?
      • Will enable sweeping simplifications for reasoning about fault-tolerance. Either the action happened or it didn’t; we don’t have to reason about an in-between state.
      • Will be realistic for applications.
    • Motivating example:
    • What should have happened? We exposed internal state; instead, all of this should have happened at once, or not at all -> it should have been atomic.
      • Understanding that this code should be atomic comes from understanding what the application is *doing*. What actions need to be atomic depends on the application.
  3. Achieving Atomicity  
    • Attempt 1:
      • Store a spreadsheet of account balances in a single file.
      • Load the file into memory, make updates, write back to disk when done.
      • (See Lecture 15 slides (PDF) for code).
    • If system crashes in the middle? Okay—on reload, we’ll read the file, will look as if transfer never happened.
    • If we crash halfway through writing file? Can’t handle that.
    • Golden rule: Never modify the only copy.
  4. Achieving Atomicity with Shadow Copies  
    • New idea: Write to a “shadow copy” of the file first, then rename the file in one step.
    • If we crash halfway through writing the shadow copy? Still have intact old copy.
    • This makes rename a “commit point.”
      • Crash before a commit point => old values.
      • Crash after a commit point => new values.
      • Commit point itself must be atomic action.
    • But then rename itself must be atomic.
      • Why didn’t we try to make write_accounts an atomic action in the first attempt? Because it would be very hard to do; you’ll see after we make rename atomic why this was the right choice.
  5. Making Rename Atomic  
    • Shadow copies are good. How do we make rename atomic?
    • What must rename() do?
      1. Point “bank_file” at “tmp_bankfile”’s inode.
      2. Remove “tmp_bankfile” directory entry.
      3. Remove refcount on the original file’s inode.
    • (See Lecture 15 slides (PDF) slides for code.)
    • Is this atomic?
      • Crash before setting dirent => rename didn’t happen.
      • Crash after setting dirent => rename happened, but refcounts are wrong.
      • Crash while setting dirent => very bad; system is in an inconsistent state.
    • So two problems:
      1. Need to fix refcounts.
      2. Need to deal with a crash while setting the dirent.
  6. Single-Sector Writes  
    • Deal with the second problem by preventing it from happening: Setting the dirent involves a single-sector write, which the disk provides as an atomic action.
      • The time spent writing a sector is small (remember: high sequential speed).
      • A small capacitor suffices to power the disk for a few microseconds (this capacitor is there to “finish” the right even on power failure). 
      • If the write didn’t start, there’s no need to complete it; the action will still be atomic.
  7. Recovering the Disk  
    • Other combinations of incrementing and decrementing refcounts won’t work. Mostly because, even if we increment before every applicable action, and decrement after the actions, we could crash between the actions and the increments/decrements.
    • Solution: Recover the disk after a crash.
      • Clean up refcounts.
      • Delete tmp files.
      • (See Lecture 15 slides (PDF) for code.)
    • What if we crash during recover? Then just run recover again after that crash.
  8. Shadow Copies: A Summary  
    • Shadow copies work because they perform updates/changes on a copy and automatically install a new copy using an atomic operation (in this case, single-sector writes)
    • But they don’t perform well
      • Hard to generalize to multiple files/directories
      • Require copying the entire file for even small changes
      • Haven’t even dealt with concurrency
  9. Transactions  
    • Speaking of concurrency: We’re going to want another thing.

    • Isolation 

      • Run A and B concurrently, and it appears as if A ran before B or vice versa.
    • Transactions: Powerful abstraction that provides atomicity *and* isolation. 

      • We’ve been trying to figure out how to provide atomicity. Our larger goal is to provide both abstractions, so that we can get transactions to actually work.
    • Examples: 

       

T1     T2
begin     begin
transfer(A, B, 20)     transfer(B, C, 5)
withdraw(B, 10)     deposit(A, 5)
end     end

  • “begin” and “end” define start and end of transaction
  • These transactions are using higher-level steps than before. T1, e.g., wants the transfer *and* the withdraw to be a single atomic action.
  • Isolation  
    • Isn’t isolation obvious? Can’t we just put locks everywhere?
    • Not exactly:
      • This solution may be correct, but will perform poorly. Very poorly for the scale of systems that we’re dealing with now.
        • We don’t *really* want only one thread to be in the transfer() code at once.
      • Locks inside transfer() won’t provide isolation for T1 and T2; those would need higher-level locks.
      • Locks are hard to get right: require global reasoning.
    • So we don’t know how to provide isolation yet. But we’ll get there eventually. We will *use* locks, just in a much more systematic way that you’ve seen until now, and in a way that performs well.
  • The Future  
    • What we’re really trying to do is implement transactions. Right now we have a not-very-good way to provide atomicity, and no way to provide isolation. (unless you count a single, global lock).
    • Next week: We’re going to get atomicity and isolation working well on a single machine.
    • Week after: We’re going to get transaction-based systems to run across multiple machines.

Course Info

Instructor
As Taught In
Spring 2018
Learning Resource Types
Lecture Notes
Written Assignments
Projects with Examples
Instructor Insights