6.033 | Spring 2018 | Undergraduate

Computer System Engineering

Week 4: Operating Systems Part IV

Lecture 6: Operating Systems Structure + Virtual Machines

Lecture 6 Outline

  1. Virtual Machines
  2. Virtual Machine Monitor (VMM) Implementation
  3. Virtualizing Memory
  4. Virtualizing U/K Bit
  5. Monolithic Kernels
  6. Microkernels: Alternative to Monolithic Kernels
  7. Summary

Lecture Slides

Reading

  • Book section 5.8

Recitation 6: Eraser

Lecture 7: Performance

Lecture 7 Outline

  1. Previously
  2. What’s Left? Performance
  3. Technique 1: Buy New Hardware
  4. General Approach
  5. Measurement
  6. How to Relax the Bottleneck
  7. Disk Throughput
  8. Caching
  9. Concurrency/Scheduling
  10. Parallelism
  11. Summary
  12. Useful Numbers for Your Day-to-Day Lives

Lecture Slides

Reading

Recitation 7: MapReduce

Hands-on Assignment 3: MapReduce

(Not available to OCW users.)

Tutorial 4: Writing the Critiques & Introduction to Collaboration

As you know, in 6.033 you’ll complete a series of three critiques, designed to build analytical and communication skills. The first critique required you to understand and assess a distributed system based on an analytical framework presented in the critique worksheet. In this critique, you’ll use your answers to a similar worksheet to write a 2–3 page critique of the system. The third critique will take the form of a peer review of another team’s Design Project Report. You’ll receive instruction about each of these critiques as the semester progresses.

System Critique Assignment 2: MapReduce

Overview of Critique Assignments

As you know, in 6.033 you’ll complete a series of three critiques, designed to build analytical and communication skills. The first critique required you to understand and assess a distributed system based on an analytical framework presented in the critique worksheet. In this critique, you’ll use your answers to a similar worksheet to write a 2–3 page critique of the system. (You will receive feedback on Critique 1 well before this assignment is due.) The third critique will take the form of a peer review of another team’s Design Project Report. You’ll receive instruction about each of these critiques as the semester progresses.

This Critique

For this critique, you should evaluate and assess MapReduce using the critique worksheet (PDF). This worksheet walks you through the process of analyzing the system described in lecture and in the textbook.

After filling in the worksheet, use the following guidelines to organize your analysis into a 2–3 page critique of MapReduce. Your audience is a community of peers who have not read the paper.

Before beginning to write, think about how you might organize your critique. You might assess each module in turn, or each design goal. You might order design goals by importance (e. g. most to least important) or by strength (strongest to weakest-or the reverse!). Use your introduction to forecast and your conclusion to recap.

The critique should be approximately 750–900 words. You should familiarize yourself with the rubric (PDF) to understand the specific elements we will use to evaluate your writing (your Teaching Assistant will also be grading this assignment, for technical content).

  • Introduction explains the goals and components of the system before making a claim about how it prioritizes properties based upon its use cases.
  • Body of the paper presents your analysis of the system by explaining how the system prioritizes properties within its design and the limits of the system and its modules.
  • Conclusion summarizes the system and its key advantages and disadvantages.

Before reading the Eraser paper, refresh your memory on what race conditions are and the troubles that they can cause by revisiting sections 5.2.2, 5.2.3, and 5.2.4 of the textbook.

Then, read “Eraser: A Dynamic Data Race Detector for Multithreaded Programs (PDF”) by S. Savage, M. Burrows, G. Nelson, P. Sobalvarro & T. Anderson.

To help you as you read:

  • After Section 2, you should understand the lockset algorithm. For instance, you should know under what condition Eraser signals a data race, and why that condition was chosen.
  • After Section 3, you should understand Eraser’s implementation details. For instance, you should know under what conditions it reports false positives.
  • Section 4 details the authors’ evaluation of and experience with Eraser. This section is useful to convince yourself that Eraser is (or isn’t!) useful, that it performs (or doesn’t perform) well, etc.

As you read, think about the following:

  • Why can’t the lockset algorithm catch every race condition?
  • Would you use Eraser? If so, in what situations?

Question for Recitation

Before you come to this recitation, write up (on paper) a brief answer to the following (really—we don’t need more than a couple sentences for each question).  

Your answers to these questions should be in your own words, not direct quotations from the paper.

  • What are the goals of Eraser?
  • How was it designed to meet those goals?
  • Why do we need a tool like Eraser? (Or why do the authors believe that we need such a tool?)

As always, there are multiple correct answers for each of these questions.

  1. Virtual Machines
    • How to run multiple OSes on one machine?
    • Constraint: Compatibility. Don’t want to change existing kernel code.
    • We’ll run multiple virtual machines (VMs) on a single CPU. Kernel equivalent is the “virtual machine monitor” (VMM).
    • Can run VMM as user-mode app inside host OS, or run VMM on hardware in kernel mode with guest OSes in user mode. We’ll talk about second, but the issues are the same.
    • Role of VMM:
      • Allocate resources.
      • Dispatch events.
      • Deal with instructions from guest OS that require interaction with the physical hardware.
    • Attempt 1: Emulate every single instruction.
      • Problem: Slow.
    • Attempt 2: Guest OSes run instructions directly on CPU.
      • Problem: Dealing with privileged instructions (can’t run in kernel mode; then we’d be back to our original problem).
    • VMM will deal with handling privileged instructions.
  2. Virtual Machine Monitor (VMM) Implementation
    • Trap and emulate
      • Guest OS in user mode.
      • Privileged instructions cause an exception; VMM intercepts these and emulates.
      • If VMM can’t emulate, send exception back up to guest OS.
    • Problems:
      • How to do the emulate
      • How to deal with instructions that don’t trigger an interrupt but that the VMM still needs to intercept.
  3. Virtualizing Memory
    • VMM needs to translate guest OS addresses into physical memory addresses. Three layers: Guest virtual, guest physical, host physical.
    • Approach 1: Shadow pages
      • Guest OS loads PTR; causes interrupt. VMM intercepts.
      • VMM locates guest OS’s page table. Combines guest OS’s table with its own table, constructing a third table mapping guest virtual to host physical.
      • VMM loads host physical addr of this new page table into the hardware PTR.
      • If guest OS modifies its page table, no interrupt thrown. To force an interrupt, VMM marks guest OS’s page table as read-only memory.
    • Approach 2
      • Modern hardware has support for virtualization.
      • Physical hardware (effectively) knows about both levels of tables: Will do lookup in the guest OS’s page table and then the VMM’s page table.
  4. Virtualizing U/K Bit
    • Problem with basic trap-and-emulate: U/K bit involved in some instructions that don’t cause exception (e.g., reading U/K bit, writing it to U).
    • Few solutions:
      • Para-virtualization: Modify guest OS. Hard to do, and goes against our compatibility goal.
      • Binary translation: VMM analyzes code from guest OS and replaces problematic instructions.
      • Hardware support: Some architectures have virtualization support built in. Have special VMM operating mode in addition to the U/K bit.
    • Hardware support is arguably the best. Makes VMM’s job easier.
  5. Monolithic Kernels
    • VMs protect OSes from each other’s faults, protect physical machine from OS faults. Why so many bugs, though?
    • The Linux kernel is, effectively, one large C program. Careful software engineering, but very little modularity within the kernel itself.
    • Bugs come about because of its complexity.
    • Kernel bugs = entire system failure (recall the in-class demo).
    • Even worse: Adversary can exploit these bugs.
  6. Microkernels: Alternative to Monolithic Kernels
    • Put subsystems—file servers, device drivers, etc.—in user programs. More modular.
    • There will still be bugs but:
      • Fewer, because of decreased complexity.
      • A single bug is less likely to crash the entire system.
    • Why isn’t Linux a microkernel, then?
      • High communication cost between modules.
      • Not clear that moving programs to userspace is worth it.
      • Hard to balance dependencies (e.g., sharing memory across modules).
      • Redesign is tough!
      • Spend a year of developer time rewriting the kernel or adding new features?
      • Microkernels can make it more difficult to change interfaces.
    • Some parts of Linux do have microkernel design aspects.
  7. Summary
    • Cool things we do with VMs: Run different OSes on a single machine, move VMs from one physical machine to another.
    • Microkernels and VMs solve orthogonal problems.
      • Microkernels: Split up monolithic designs.
      • VMs: Let us run many instances of an existing OS. They are, in some sense, a partial solution to monolithic kernels (at least we can run these kernels safely). But their goal is to run multiple OSes on a single piece of hardware, not to target monolithic OSes specifically.
    • VMs most commonly implemented with hardware support (a special VMM mode in addition to U/K bit).
  1. Previously 
    • Enforced modularity on a single machine via virtualization.
      • Virtual memory, bounded buffers, threads.
    • Saw monolithic vs. microkernels.
    • Talked about VMs as a means to run multiple instances of an OS on a single machine with enforced modularity (bug in one OS won’t crash the others).
      • Big thing to solve was how to implement the VMM. Solution: Trap and emulate. How the emulation works depends on the situation.
        • Another key problem: How to trap instructions that don’t generate interrupts.
  2. What’s left? Performance 
    • Performance requirements significantly influence a system’s design.
    • Today: General techniques for improving performance.
  3. Technique 1: Buy New Hardware 
    • Why? Moore’s law => processing power doubles every 1.5 years, DRAM density increase over time, disk price (per GB) decreases, …
    • But:
      • Not all aspects improve at the same pace.
      • Moore’s Law is plateauing.
      • Hardware improvements don’t always keep pace with load increases.
    • Conclusion: Need to design for performance, potentially re-design as load increases.
  4. General Approach 
    • Measure the system and find the bottleneck (the portion that limits performance).
    • Relax (improve) the bottleneck.
  5. Measurement 
    • To measure, need metrics:
      • Throughput: Number of requests over a unit of time.
      • Latency: Amount of time for a single request.
      • Relationship between these changes depending on the context.
      • As system becomes heavily-loaded:
        • Latency and throughput start low. Throughput increases as users enter, latency stays flat…
        • ..until system is at maximum throughput. Then throughput plateaus, latency increases.
      • For heavily-loaded systems: Focus on improving throughput.
    • Need to compare measured throughput to possible throughput: Utilization.
    • Utilization sometimes makes bottleneck obvious (CPU is 100% utilized vs. disk is 20% utilized), sometimes not (CPU and disk are 50% utilized, and at alternating times).
    • Helpful to have a model in place: What do we expect from each component?
    • When bottleneck is not obvious, use measurements to locate candidates for bottlenecks, fix them, see what happens (iterate).
  6. How to Relax the Bottleneck 
    • Better algorithms, etc. These are application-specific. 6.033 focuses on generally-applicable techniques.
    • Batching, caching, concurrency, scheduling.
    • Examples of these techniques follow. The examples related to operating systems (that’s what you know), but techniques apply to all systems.
  7. Disk Throughput 
    • How does an HDD (magnetic disk) work? 
      • Several platters on a rotating axle.
      • Platters have circular tracks on either side, divided into sectors.
        • Cylinder: Group of aligned tracks.
      • Disk arm has one head for each surface, all move together.
      • Each disk head reads/writes sectors as they rotate past. Size of a sector = unit of read/write operation (typically 512B).
      • To read/write:
        • Seek arm to desired track.
        • Wait for platter to rotate the desired sector under the head.
        • Read/write as the platter rotates.
    • What about SSDs? 
      • Organized into cells, each of which hold one (or 2, or 3) bits.
      • Cells organized into pages; pages into blocks.
      • Reads happen at page-level. Writes also at page-level, but to new pages (no overwrites of pages).
      • Erases (and thus overwrites) are at block-level.
        • Takes a high voltage to erase.
    • How long does R/W take on HDD? 
      • Example disk specs:
        • Capacity: 400GB
        • Platters: 5
        • # heads: 10
        • # sectors per track: 567–1170 (inner to outer)
        • # bytes per sector: 512
        • Rotational speed: 7200 RPM => 8.3ms per revolution
      • Seek time: Avg read seek 8.2ms, avg write seek 9.2ms.
        • Given as part of disk specs
      • Rotation time: 0–8.3ms.
        • Platters only rotate in one direction.
      • R/W as platter rotates: 35–62MB/sec.
        • Also given in disk specs.
      • So reading random 4KB block: 8.2ms + 4.1ms + ~.1ms = 12.4
      • 4096 B / 12.4 ms = 322KB/s. 
        => 99% of the time is spent moving the disk.
    • Can we do better? 
      • Use flash? For this particular random-access of reads, yes; SSDs would help if available.
      • Batch individual transfers?
        • .8ms to seek to next track + 8.3ms to read entire track = 9.1ms.
          • .8ms is single-track seek time for our disk (again, from specs).
        • 1 track contains ~1000sectors * 512B = 512KB.
        • Throughput: 512KB/9.1ms = 55MB/s.
    • Lesson: Avoid random access. Try to do long sequential reads. 
      • But how?
        • If your system reads/writes entire big files, lay them out contiguously on disk. Hard to achieve in practice!
        • If your system reads lots of small pieces of data, group them.
  8. Caching 
    • Already saw in DNS. Common performance-enhancement for systems.
    • How do we measure how well it works?
      • Average access time: Hit_time * hit_rate + miss_time * miss_rate.
    • Want high hit rate. How do we know what to put in the cache?
      • Can’t keep everything.
      • So really: How do we know what to *evict* from the cache?
    • Popular eviction policy: Least-recently used.
      • Evict data that was used the least recently.
      • Works well for popular data.
      • Bad for sequential access (think: Sequentially accessing a dataset that is larger than the cache).
    • Caching is good when:
      • All data fits in the cache.
      • There is locality, temporal or spatial.
    • Caching is bad for:
      • Writes (writes have to go to cache and disk; cache needs to be consistent, but disk is non-volatile).
    • Moral: To build a good cache, need to understand access patterns
      • Like disk performance: To relax disk as bottleneck, needed to understand details of how it works
  9. Concurrency/Scheduling 
    • Suppose server alternates between CPU and disk: 
      CPU: –A– –B– –C–
      Disk: –A– –B– –C–
    • Apply concurrency, can get: 
      CPU: –A—-B—-C– …
      Disk: –A—-B– ..
    • This is a scheduling problem: Different orders of execution can lead to different performance.
    • Example: 
      • 5 concurrent threads issue concurrent reads to sectors 71, 10, 92, 45, and 29.
      • Naive algorithm: Seek to each sector in turn.
      • Better algorithm: Sort by track and perform reads in order. Gets even higher throughput as load increases.
        • Drawback: It’s unfair.
    • No one right answer to scheduling. Tradeoff between performance and fairness.
  10. Parallelism 
    • Goal: Have multiple disks, want to access them in parallel.
    • Problem: How do we divide data across the disks?
    • Depends on bottleneck:
      • Case 1: Many requests for many small files. Limited by disk seeks. Put each file on a single disk, and allow multiple disks to seek multiple records in parallel.
      • Case 2: Few large reads. Limited by sequential throughput. Stripe files across disks.
    • Another case: Parallelism across many computers.
      • Problem: How do we deal with machine failures?
      • (One) Solution: Go to recitation tomorrow!
  11. Summary 
    • We can’t magically apply any of the previous techniques. Have to understand what goes on underneath.
      • Batching: How disk access works.
      • Caching: What is the access pattern?
      • Scheduling/concurrency: How disk access works, how system is being used (the workload).
      • Parallelism: What is the workload?
    • Techniques apply to multiple types of hardware.
      • E.g., caching is useful regardless of whether you have HDD or SSD.
  12. Useful numbers for your day-to-day-lives: 
    • Latency:
      • 0.00000001ms: Instruction time (1 ns)
      • 0.0001ms: DRAM load (100 ns)
      • 0.1ms: LAN network
      • 10ms: Random disk I/O
      • 25–50ms: Internet east -> west coast
    • Throughput:
      • 10,000 MB/s: DRAM
      • 1,000 MB/s: LAN (or100 MB/s)
      • 100 MB/s: Sequential disk (or 500 MB/s)
      • 1 MB/s: Random disk I/O

Preparation for MapReduce recitation

  • Read “MapReduce (PDF)” by J. Dean & S. Ghemawat.
  • Skip sections 4 and 7

This paper was published at the biennial Usenix Symposium on Operating Systems Design and Implementation (OSDI) in 2004, one of the premier conferences in computer systems. (OSDI alternates with the equally prestigious ACM Symposium on Operating Systems Principles (SOSP), at which appeared Eraser, the paper you already read in a previous recitation.)

After reading through Section 3, you should be able to understand and explain Figure 1 (the “Execution overview”). After reading Sections 5 and 6, you should understand the real-world performance of MapReduce. An example question that you should be able to answer: How do stragglers effect performance?

As you read, think about the following:

  • MapReduce has a constrained programming model. Are the benefits of using MapReduce worth that constraint?
  • What types of failures does MapReduce handle, and how does it handle them?

Question for Recitation

Before you come to this recitation, write up (on paper) a brief answer to the following (really—we don’t need more than a couple sentences for each question). 

Your answers to these questions should be in your own words, not direct quotations from the paper.

  • What are the performance goals of MapReduce (both the programming model + its implementation)?
  • How was MapReduce implemented at Google to meet those goals?
  • Why was MapReduce implemented in this way?

As always, there are multiple correct answers for each of these questions.

Course Info

Instructor
As Taught In
Spring 2018
Learning Resource Types
Lecture Notes
Written Assignments
Projects with Examples
Instructor Insights