6.033 | Spring 2018 | Undergraduate

Computer System Engineering

Week 11: Security Part I

Lecture 19: Availability via Replication

Lecture 19 Outline

  1. Introduction
  2. The Problem
  3. Replicated State Machines (RSMs)
  4. Primary/Backup Model
  5. View Servers
  6. View Servers in the Face of Network Partitions
  7. Recruiting New Backups
  8. Dealing with Centralization

Lecture Slides

Reading

  • No assigned reading

Recitation 19: Raft

Lecture 20: Introduction to Security

Lecture 20 Outline

  1. Introduction
  2. Computer Security vs. General Security
  3. Difficulties of Computer Security
  4. Modeling Security
  5. Guard Model
  6. What Can Go Wrong?

Lecture Slides

Reading

  • Book section 11.1

Recitation 20: [No Recitation]

Tutorial 11: Design Project Peer Review

The focus of Tutorial 11 is peer review. You will read some portions of another team’s final design report (DPR) and offer feedback and insight. First, complete the Preparation for Tutorial 11 (PDF). Then, take a look at the Peer Review Assignment (PDF). See the Design Project section for more information.

Design Project (DP) Peer Review Assignment

  1. Introduction
    • Have multi-site atomicity; want to improve availability via replication.
    • Goal: Single-copy consistency.
      • Property of the externally-visible behavior of a replicated system.
      • Operations appear to execute as if there’s only a single copy of the data.
    • Is single-copy consistency always needed? No.  DNS, PNUTS don’t use it. But some systems need it.
  2. The Problem
    • Main problem: Messages can arrive at replicas in different orders resulting in inconsistent state.
      • See slides: C1 and C2 issue two writes, arrive in different orders.
    • Note that the network causes this problem. Network will be problematic in general.
  3. Replicated State Machines (RSMs)
    • RSMs ensure that each replica ends up with same final state by:
      • Starting with the same initial state on each server.
      • Providing each replica with the same input operations, in the same order.
      • Ensuring that all operations are deterministic (e.g., no randomness, no reading of current time, etc.).
    • Assumption: FTailures are independent.
      • Not always true!
  4. Primary/Backup Model
    • [Clients] –> Coordinator –> Primary server –> Backup server
    • Primary does important stuff:
      • Ensures that it sends all updates to the backup before ACKing the coordinator.
      • Chooses an ordering for all operations, so that the primary and backup agree.
      • Decides all non-deterministic values (e.g., random(), time()).
    • What if primary fails?
      • Idea 1: Coordinator knows about both primary and backup, and decides which to use.
        • Won’t work: Split brain syndrome. Multiple coordinators come to independent, and different, conclusions about who is primary when there are network partitions (see Lecture 19 slides (PDF)).
      • Idea 2: Have human decide when to switch from primary to backup.
        • Not unreasonable for small webservices.
  5. View Servers
    • Add view server to primary/backup model.
    • Basic functionality:
      1. The view server keeps a table that maintains a sequence of “view”. Each view contains the view number, the primary server, and the backup server.
      2. The VS alerts each server as to whether it’s the primary or the backup.
      3. Upon receiving any updates, the primary will receive an ACK from the backup before responding to the VS (just as before).
      4. Coordinators make requests to the VS asking who is primary. Coordinators then contact the primary.
    • To discover failures: Replicas ping to the VS. If the VS misses N pings in a row, it deems a server to be dead.
    • Basic failure (actual worker crash):
      1. Primary fails; pings cease.
      2. VS lets S2 know it’s primary, and it handles any client requests.
        • Before S2 knowing it’s the primary, it will simply reject requests that come directly from coordinators.
      3. VS will eventually—hopefully quickly—recruit a new idle server to act as backup.
  6. View Servers in the Face of Network Partitions
    • We have a few rules in place:
      1. Primary must wait for backup to accept each request.
      2. Non-primary must reject direct coordinator requests (that’s what happened in the earlier failure, in the interim between the failure and S2 hearing that it was primary).
    • Add two more:
      1. Primary must reject forwarded requests (i.e., it won’t accept an update from the backup).
      2. Primary in view i must have been primary or backup in view i-1.
    • Now consider S1 being partitioned from the VS (see Lecture 19 slides (PDF)).
      • Before S2 hears about View 2:
        • S1 can process operations from coordinators, S2 will accept forwarded requests.
        • S2 will reject operations from coordinators who have heard about view 2.
      • After S2 hears about View 2:
        • If S1 receives coordinator requests, it will forward.  S2 will reject (not ACK), so S1 can no longer act as primary.
        • S1 will send error to coordinator, coordinator will ask VS for new view, learn about view 2, and coordinator will re-send to S2.
      • True moment of switch-over: When S2 hears about View 2.
  7. Recruiting New Backups
    • When primary fails and backup is promoted, VS will eventually—hopefully quickly—find a new (idle) server to be backup.
    • New backup copies state over from primary and is then ready to go.
    • If primary fails during that copy?
      • Don’t promote backup to primary; has incomplete state.
      • Keep failed primary as primary; perhaps the failure is a network issue and it’ll come back up.
    • Moral of the story: Having multiple backups is not a bad idea.
  8. Dealing with Centralization
    • As described, view server is a central point of failure, which seems terrible. Also seems like it could be the bottleneck of the system.
    • Easy to distribute the view server: View servers 1 through N, each responsible for a different partition of replica sets.
    • Distribution != replication, though.
    • Can we replicate the view server in the same way?  Nope; we’d need a view server for the view servers.
    • We need a mechanism for “distributed consensus.”
      • Raft: See in recitation tomorrow.
      • Paxos: The original distributed consensus mechanism.

Minor note: In some texts, you will see the word “client” used when describing RSMs where I used “coordinator”. I am using “coordinator” because we’re coming from the world of 2PC: Clients (users) send request to the coordinators, and those coordinators deal with the internals of the system (contacting appropriate servers, etc.). I want you to understand how this whole picture fits together.

But a lot of times we use “client” as a generic term in the client/server setup. So in reading about RSMs, when you see client,it’s just the part of the system sending requests to the servers for data. In the context of 2PC, the client in an RSM is the coordinator from 2PC.

Disclaimer: This is part of the security section in 6.033. Only use the information you learn in this portion of the class to secure your own systems, not to attack others.

  1. Introduction 
    • Previously in 6.033: Building reliable systems in the face of more-or-less random, more-or-less independent failures.
    • Today: Building systems that uphold some goals in the face of targeted attacks from an adversary.
    • What can an adversary do?
      • Personal information stolen
      • Phishing attacks
      • Botnets
      • Worms / viruses
      • Etc.
  2. Computer Security vs. General Security 
    • Similarities:
      • Compartmentalization (different keys for different things).
      • Log information, audit to detect compromises.
      • Use legal system for deterrence.
    • Differences:
      • Internet = fast, cheap, scalable attacks.
      • Number and type of adversaries is huge.
      • Adversaries are often anonymous.
      • Adversaries have a lot of resources (botnets).
      • Attacks can be automated.
      • Users have poor intuition about computer security.
  3. Difficulties of Computer Security 
    • Aside from everything above…
    • It’s difficult to enumerate all threats facing computers.
    • Achieving something despite whatever an adversary might do is a negative goal.
      • Contrast: An example of a positive goal is “Katrina can read grades.txt”. Can easily check to see if the goal is met.
      • Example of a negative goal: “Katrina cannot read grades.txt”. Not enough to just ask Katrina if she can read grades.txt and have her respond “no.”
        • Hard to reason about all possible ways she could get access: Change permissions, read backup copy, intercept network packets…
    • One failure due to an attack might be one too many (e.g., disclosing grades.txt even once).
    • Failures due to an attack can be highly correlated; difficult to reason about failure probabilities.
    • As a result: We have no complete solution. We’ll learn how to model systems in the context of security and how to assess common risks/combat common attacks.
  4. Modeling Security 
    • Need two things:
      1. Our goals, or our “policy.”
        • Common ones:
          • Privacy: Limit who can read data.
          • Integrity: Limit who can write data.
          • Availability: Ensure that a service keeps operating.
      2. Our assumptions, or our “threat model.”
        • What are we protecting against? Need plausible assumptions.
        • Examples:
          • Assume that the adversary controls some computers or networks but not all of them.
          • Assume that then adversary controls some software on computers, but doesn’t fully control those machines.
          • Assume that the adversary knows some information, such as passwords or encryption keys, but not all of them.
    • Many systems are compromised due to incomplete threat models or unrealistic threat models.
      • E.g., assume the adversary is outside of the company network/firewall when they’re not. Or don’t assume that the adversary can do social engineering.
    • Try not to be overambitious with our threat models; makes modularity hard.
    • Instead: Be very precise, and then reason about assumptions and solutions. Easier to evolve threat model over time.
  5. Guard Model 
    • Back to client/server model.
    • Usually, client is making a request to access some resource on the server. So we’re worried about security at the server.
    • To attempt to secure this resource, server needs to check all accesses to the resource. “Complete mediation.”
    • Server will put a “guard” in place to mediate every request for this particular resource. Only way to access the resource is to use the guard.
    • Guard often provides:
      • Authentication: Verify the identify of the principal. E.g., checking the client’s username and password..
      • Authorization: Verify whether the principal has access to perform its request on the resource. E.g., by consulting an access control list for a resource.
    • Guard model applies lots of places, not just client/server.
    • Uses a few assumptions:
      • Adversary should not be able to access the server’s resources directly.
      • Server properly invokes the guard in all the right places.
      • (We’ll talk about what happens if these are violated later.)
    • Guard model makes it easier to reason about security.
    • Examples:
      1. UNIX file system
        • Client: A process.
        • Server: OS kernel.
        • Resource: Files, directories.
        • Client’s requests: Read(), write() system calls.
        • Mediation: U/K bit and the system call implementation.
        • Principal: User ID.
        • Authentication: Kernel keeps track of a user ID for each process.
        • Authorization: Permission bits & owner UID in each file’s inode.
      2. Web server running on UNIX
        • Client: HTTP-speaking computer.
        • Server: Web application (let’s say it’s written in python).
        • Resource: Wiki pages (say).
        • Requests: Read/write wiki pages.
        • Mediation: Server stores data on local disk, accepts only HTTP requests (this requires setting file permissions, etc., and assumes the OS kernel provides complete mediation).
        • Principal: Username.
        • Authentication: Password.
        • Authorization: List of usernames that can read/write each wiki page.
      3. Firewall. (A firewall is a system that acts as a barrier between a, presumably secure, internal network and the outside world. It keeps untrusted computers from accessing the network.)
        • Client: Any computer sending packets.
        • Server: The entire internal network.
        • Resource: Internal servers.
        • Requests: Packets.
        • Mediation:
          • Internal network must not be connected to internet in other ways.
          • No open wifi access points on internal network for adversary to use.
          • No internal computers that might be under control of adversary.
        • Principal, authentication: None.
        • Authorization: Check for IP address & port in table of allowed connections.
  6. What Can Go Wrong? 
    1. Complete mediation is bypassed by software bugs.
    2. Complete mediation is bypassed by an adversary.
      • How do we prevent these things? Reduce complexity: Reduce the number of components that must invoke the guard.
      • In security terminology, this is the “principle of least privilege”. Privileged components are “trusted”. We limit the number of trusted components in our systems, because if one breaks, it’s bad.
    3. Policy vs. mechanism. High-level policy is (ideally) concise and clear. Security mechanisms (e.g., guards) often provide lower-level guarantees.
    4. Interactions between layers, components. Consider this code: 
      > cd /mit/bob/project > cat ideas.txt Hello world. ... > mail alice@mit.edu < ideas.txt 
      Seems fine. But suppose in between us cat’ing ideas.txt and mailing Alice, Bob changes ideas.txt to a symlink to grades.txt.
    5. Users make mistakes.
    6. Cost of security. Users may be unwilling to pay cost (e.g., inconvenience) of security measures. Cost of security mechanism should be commensurate with value.

For this recitation, you’ll be reading “In Search of an Understandable Consensus Algorithm (PDF)” by Diego Ongaro and John Ousterhout. This paper describes Raft, an algorithm for achieving distributed consensus. The paper contrasts Raft to an algorithm called Paxos: You do not need to know anything about Paxos to read this paper. Raft was designed to be more understandable than Paxos.

Before reading the paper, check out two very helpful websites, which have some useful visualizations:

With those visualizations in mind, read the paper. Skip sections 5.4.3 and 7, and skim sections 9.1 and 9.2.

  • The first four sections give background and motivation for Raft. Sections five and six are the primary technical sections.
  • Fig. 2 is a good reference to come back to after you’ve read the paper. Don’t get stuck trying to memorize the entire table before you move onto page 5 of the paper; skip it, come back to it during your reading or at the end.

To check your understanding after reading:

  • How does Raft handle the following three types of failures?: Leader failures, candidate or follower failures, and network partitions (a network partition means that one part of the cluster is unable to communicate with any machine in the other part).
  • Take a look at figure 7. For each log, (a)-(f), what sequence of events might have led to that log?

Questions for Recitation

Before you come to this recitation, write up (on paper) a brief answer to the following (really—we don’t need more than a couple sentences for each question). 

Your answers to these questions should be in your own words, not direct quotations from the paper.

  • In Raft, what is the leader’s function?
  • How does the leader work?
  • Why does Raft need a leader?

As always, there are multiple correct answers for each of these questions.

Course Info

Instructor
As Taught In
Spring 2018
Learning Resource Types
Lecture Notes
Written Assignments
Projects with Examples
Instructor Insights