6.830 | Fall 2010 | Graduate, Undergraduate

Database Systems

Readings

Many of the readings are drawn from the following sources:

Buy at MIT Press Hellerstein, Joseph, and Michael Stonebraker. Readings in Database Systems (The Red Book). 4th ed. MIT Press, 2005. ISBN: 9780262693141.

Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. 3rd ed. McGraw-Hill, 2002. ISBN: 9780072465631.

LEC # TOPICS
1 Introduction (CC)
2 The relational model (MS)
3 Schema design (MS)
4 Introduction to database internals (CC)
5 Database operators and query processing (CC)
6 Indexing and access methods (CC)
7 Buffer pool design and memory management (CC)
8 Join algorithms (CC)
9 Query optimization (CC)
10 Transactions and locking (MS)
11 Optimistic concurrency control (MS)
12 Recovery (MS)
13 Recovery (MS) (cont.)
14 Degrees of consistency (MS)
15

C-store

Guest lecture: Sam Madden, MIT CSAIL

16 Distributed transactions (CC)
17 Parallel databases (CC)
18 Scientific databases (MS)
19 NOSQL (CC)
20 ORM, DryadLINQ (CC)
21 Streaming databases (MS)
22 Database as a service (CC)
23 Final project presentations

In this class, we will begin our discussion of concurrency control and recovery. To prepare for this class, please read:

Franklin, Michael. “Concurrency Control and Recovery (PDF)The Computer Science and Engineering Handbook. 1997. 

As you read, note any terms you are unfamiliar with and come to class prepared to ask questions!

In this lecture, we will continue our discussion of concurrency control and study another approach for isolating transactions called “Optimistic Concurrency Control.” Read:

Kung, H. T., and John T. Robinson. “On Optimistic Methods for Concurrency Control.” ACM Transactions on Database Systems 6, no. 2 (1981): 213-226. In the Red Book.

As you read, note any terms you are unfamiliar with and come to class prepared to answer the following questions:

  • When would you expect that optimistic concurrency control would outperform locking-based concurrency control?
  • Can optimistic concurrency control result in deadlock?
  • How would you implement optimistic concurrency control in SimpleDB?

In this class, we will begin discussing ARIES and recovery. The Franklin paper assigned for Lec #10 includes a good summary of the essentials of the ARIES algorithm.

  • Ramakrishnan and Gehrke. Database Management Systems. Read the section about crash recovery.
  • (Optional) Mohan, C., et al. “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.” ACM Transactions on Database Systems 17, no. 1 (1992): 94-162. In the Red Book. Read sections 1-7, and skim sections 12 and 13 (long and difficult).

In this lecture, we will continue discussing the ARIES recovery protocol that we began last week. No additional reading is assigned.

Read:

Gray, Jim, et al. “Granularity of Locking and Degrees of Consistency in a Shared Data Base.” From Modeling in Data Base Management Systems. Edited by G. M. Nijssen. 1976. In the Red Book.

This paper presents two distinct ideas: first, the notion that there is a “hierarchy of locking” — e.g., that databases can lock pages, tables, tuples, or fields, and that there is a tradeoff between these various degrees of locking. Second, the paper presents that idea that there are different degrees of consistency which transactions can require, and those different degrees imply different locking disciplines with different performance characteristics.

As you read the paper, consider the following questions:

  1. What is the tradeoff between coarse granularity (e.g., table-level) and fine granularity (e.g., tuple-level) locks in a database system?
  2. How does hierarchical locking help address the phantom problem (hint: see page 374 and the discussion of index-interval locks.)
  3. What are the performance tradeoffs between the different degrees of locking?
  4. How does degree-0 locking ensure compatibility with all other locking modes?

Read:

Stonebraker, Michael, et al. C-Store: A Column-oriented DBMS. Proceedings of the 31st VLDB Conference, 2005. (PDF)

This paper discusses the C-Store system, which is a read-optimized database system with a novel physical arrangement of data on disk, as well as an unusual approach to recovery and query processing

As you read the paper, consider the following questions:

  1. Why does the C-Store idea of arranging data as columns improve query performance?
  2. How does C-Store propose to do recovery? Is there an advantage to this approach?
  3. How does C-Store isolate transactions? Do you think this is a good idea?

We will continue our discussion of parallel and distributed query processing. You should read:

Mohan, C., B. Lindsay, and R. Obermarck. “Transaction Management in the R* Distributed Database Management Systems.” ACM Transactions on Database Systems 11, no. 4 (1986): 378-396. In the Red Book.

This paper discusses distributed transactions, addressing the problem of providing ACID-style semantics in a shared nothing environment.

As you read the paper, consider the following questions:

  1. In the “R*” paper, how does the two phase commit (2PC) protocol work? What problem does it solve? What are the costs of using it?
  2. What is the significance of the Presumed Abort/Presumed Commit variants of 2PC? How do they reduce the overhead of 2PC? When should you choose one over the other?

Read:

Dewitt, David, and Jim Gray. “Parallel Database Systems: The Future of High Performance Database Processing.” Communications of the ACM 35, no. 6 (1992): 85-98. In the Red Book.

The Dewitt and Gray paper is a high level summary of database architectures for parallelism, illustrating some of the techniques that can be used to exploit the availability of multiple processors in a database system.

Questions to consider:

  1. What’s the difference between a parallel and a distributed database? What issues are different in one architecture versus the other? In what ways are the two architectures alike?
  2. Why do Dewitt and Gray advocate a shared nothing architecture?
  3. In what ways must existing database architectures be modified to support multi-processor environments? What new data layout issues are introduced? What new query processing challenges must be addressed?

Read:

Stonebraker, Michael, et al. “Requirements for Science Data Bases and SciDB.” CIDR Perspectives (2009).

The paper is a high-level description of the requirements of a new generation DBMS (SciDB) that supports modern scientific projects.

Read:

Chang, Fay, et al. “Bigtable: A Distributed Storage System for Structured Data.” OSDI, 2006. (PDF)

Bigtable is a storage system for structured data; it is essentially a compressed, horizontally partitioned, column-oriented database with some interesting properties that allow it to reach “web scale”.

As you read the paper, consider the following questions:

  1. In what ways does Bigtable replicate the behavior of a relational database? How is it different?
  2. Could Google have built Bigtable using a relational system? Would that have been a good idea? Why or why not?
  3. What consistency guarantees does Bigtable provide in the face of failures?
  4. What is a distributed lock service?

In this lecture, we will continue our discussion of data models and database system architecture, looking in more detail at the relational model.

There is a lot of reading for this lecture. You should start early and try to digest it all, as it lays the foundation for much of what is to come. The papers are:

  • Stonebraker, Michael, and Joseph Hellerstein. “What Goes Around Comes Around.” In Readings in Database Systems (aka the Red Book), or online here (PDF). Read sections 1-4 (if you know something about XML, you may also enjoy reading sections 10 and 11).
  • Codd, E. F. “A Relational Model of Data for Large Shared Data Banks.” Communications of the ACM 13, no. 6 (1970): 377-387. (Focus on sections 1.3 and all of section 2.)

You may also find in useful to read pages 57-63 of Ramakrishnan and Gehrke for a brief overview of the relational model.

As you read these papers, think about and be prepared to answer the following questions in Lecture:

  1. What is the notion of data independence? Why is it important?
  2. Codd spends a fair amount of time talking about “Normal forms”. Why is it important that a database be stored in a normal form?
  3. What are the key ideas behind the relational model? Why are they an improvement over what came before? In what ways is the relational model restrictive?
  4. What, according to Codd, are the most important differences between the “hierarchical” model (as exemplified by systems like IMS) and the relational model that Codd proposes? Make sure you understand what Codd means by “Data Dependencies”.

Read:

Yu, Yuan, et al. “DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language.” OSDI, 2008. (PDF)

DryadLINQ is a programming language for manipulating structured data in a distributed setting. It provides a collection of SQL-like constructs that are well-integrated into C# (with a common type and object system), and compiles down to a graph of operators spread across a distributed network of machines in a way similar to how distributed databases work.

As you read the paper, consider the following questions:

  1. What are the advantages of a query language that is integrated into the programming language? Are there disadvantages?
  2. Dryad execution plans look a lot like database query plans, but are different in some ways — in particular, operators can have multiple outputs; what are the implications of this from a query execution perspective?

The reading for this lecture is:

  1. The start of chapter 2 to end of section 2.6 (pages 25-46) of Ramakrishnan and Gehrke
  2. Section 3.5 (pages 75-86) of Ramakrishnan and Gehrke
  3. The start of chapter 19 to end of section 19.4 (pages 605-619) of Ramakrishnan and Gehrke

The first two readings discuss ER modeling, which is one practical way which can be used to model a database and generate database schemas.

The third reading discusses a formal model based on the notion of functional dependencies that makes it possible to reason about whether schemas are free of anomalies that lead to operational problems in database system execution. You should focus on understanding BCNF and 3NF; we will not discuss the higher normal forms in much detail.

The relationship between these three readings is that ER modeling generally produces schemas that conform to 3NF/BCNF, though this isn’t strictly true (see section 19.7 for a discussion of cases when ER modeling doesn’t lead to BCNF or 3NF schemas.) For those who are interested, it turns out that there are relatively straightforward algorithms for generating BCNF/3NF schemas given a collection of functional dependencies — these are given in Sections 19.5 and 19.6 of Ramakrishnan and Gehrke, but you don’t need to know these in detail.

As you read these chapters, think about and be prepared to answer the following questions in lecture:

  • What problems does schema normalization solve? Do you believe that these are important problems?
  • What is the distinction between BCNF and 3NF? Is there a reason to prefer one over the other?
  • Think about a data set you have worked with recently, and try to derive a set of functional dependencies that correspond to it. What assumptions did you have to make in modeling your data in this way?
  • How is it that ER modeling generally leads to BCNF/3NF schemas?

We will start discussing database system internals, based on the content of a paper:

Hellerstein, Joseph, Michael Stonebraker, and James Hamilton. “Architecture of a Database System.” Foundations and Trends in Databases 1, no. 2 (2007): 141-259. (PDF)

It is a rather long paper (don’t be too scared by the 119 pages, the page format makes it look much longer than it is) that is in general worth reading, however we only require you to read sections: 1, 2 (skim through it), 3, 4 (up to subsection 4.5 included), 5. You can also skim through section 6 that we will discuss later on.

The purpose of these readings is to introduce the architecture of a database system at a high level. Our goal in lecture will be to tease apart the main components of most database systems. Once we’ve identified these components, we will discuss each over the next few weeks.

The paper assumes a certain degree of familiarity with database ’lingo’, some of which will doubtless be unfamiliar to you. As you read, keep track of terms you do not know and come to class prepared to ask questions!

We will continue discussing database system internals, based on the same paper of last lecture:

Hellerstein, Joseph, Michael Stonebraker, and James Hamilton. “Architecture of a Database System.” Foundations and Trends in Databases 1, no. 2 (2007): 141-259. (PDF)

We will be focusing on query processing, optimization and memory/storage management. This is still an overview of the concepts that we will present in more details in following lectures.

This lecture will cover various issues related to the physical storage of relations on disk, as well as index data structures we might use to efficient access those stored relations. Read the following (short) papers and book sections:

  • In Database Management Systems, read:
    • Pages 273-289. If you are using another book, this is the introduction to the section on Storage and Indexing which discusses different access methods and their relative performance.
    • Pages 344-358. This is an in-depth discussion of the B+Tree data structure and its implementation. Most database books, as well as any algorithms text (such as CLR or Knuth) will provide an equivalent discussion of B+Trees.
  • “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles.” Beckmann et al, in the Red Book.

As you read, think about and come to class prepared to answer the following questions:

  • Under what circumstances is a secondary index superior to a sequential (in-order) scan of a heap file? Under what circumstances would the secondary index scan be preferable?
  • What is the purpose of the neighbor pointers in a B+Tree? When are they useful?
  • Why are B+Trees insufficient for storing and indexing the types of data stored by R*-Trees?

This lecture will cover buffer and memory management strategies in database systems.

Read the following paper:

Chou, Hong-Tai, and David DeWitt. An Evaluation of Buffer Management Strategies for Relational Database Systems. Proceedings of VLDB, 1985, pp. 127-141.

As you read, think about and come to class prepared to answer the following questions:

  • How much better is the hot-set model than simple LRU?
  • Under what circumstances is LRU a bad idea?
  • Why can these techniques be applied in database systems but not operating systems?

This lecture will cover join algorithms, focusing in particular on Hash Join and Sort-Merge join and the relative tradeoffs of the two approaches.

Read the following paper:

Shapiro, L. D. “Join Processing in Database Systems with Large Main Memories.” In the Red Book.

In this lecture, we will discuss the problem of query optimization, focusing on the algorithms proposed in the classic “Selinger” paper.

Read the following papers:

  • Selinger, Patricia, M. Astrahan, D. Chamberlin, Raymond Lorie, and T. Price. “Access Path Selection in a Relational Database Management System.” Proceedings of ACM SIGMOD (1979): 23-34.
  • Optionally, you may also wish to look at: Mannino, Michael, Paichen Chu, and Thomas Sager. “Statistical Profile Estimation in Database Systems.” ACM Computing Surveys 20, no. 3 (1988): 191-221. This paper discusses many of the techniques that used to make query optimization and cost estimation practical in modern database systems. We will cover some of the ideas at a high level in class.

As you read, think about and come to class prepared to answer the following questions:

  • The Selinger paper claims to be ‘optimal’. Under what assumptions is this optimality true? Can you think of a situation in which Selinger will definitely be non-optimal?
  • Query optimization is highly dependent on the effectiveness of cost estimation. The cost metrics that Selinger proposes are very simple; how would you make them more sophisticated? What is the impact of more sophisticated cost metrics on the performance of a database system?
Learning Resource Types
Exams with Solutions
Lecture Notes
Programming Assignments