## Texts

### Required

The primary source will be the book *Distributed Algorithms *by Prof. Nancy Lynch.

[Lynch] = Lynch, Nancy. *Distributed Algorithms*. Burlington, MA: Morgan Kaufmann, 1996. ISBN: 9781558603486.

Errata (PDF)

This book has gone through many printings, but we have made no changes since the fourth printing, so fourth printings or later are just fine. Known errata for early printings are collected in errata lists. The book refers to many papers from the research literature on distributed algorithms; you might want to track down and read some of these.

### Recommended

Other books that you will find useful are:

[Attiya and Welch] = Attiya, Hagit, and Jennifer Welch.* Distributed Computing: Fundamentals, Simulations, and Advanced Topics*. 2nd ed. New York, NY: Wiley-Interscience, 2004. ISBN:9780471453246.

This is another textbook on distributed algorithms, initially published a little after the Lynch book. It now has a second edition. The material covered overlaps quite a lot with the Lynch book, though Attiya and Welch do cover some topics, like clock synchronization, that Lynch does not cover. The style is a little less formal.

[Herlihy and Shavit] = Herlihy, Maurice, and Nir Shavit. *The Art of Multiprocessor Programming*. Burlington, MA: Morgan Kaufmann, 2008. ISBN: 9780123705914.

This undergraduate textbook covers basic algorithms and techniques for multiprocessor programming.

[Guerraoui and Kapalka] = Guerraoui, Rachid, and Michal Kapalka. *Transactional Memory: The Theory*. San Rafael, CA: Morgan and Claypool, 2010. ISBN: 9781608450114.

This monograph presents basic theoretical results about the possibility and costs of implementing transactional memory for shared-memory multiprocessors.

[Dolev] = Dolev, Shlomi.* Self-Stabilization*. Cambridge, MA: MIT Press, 2000. ISBN: 9780262041782.

This book gives a good description of self-stabilizing distributed algorithms. Self-stabilization is a strong kind of fault-tolerance, which we will study near the end of the course.

Kaynar, Disun, Nancy Lynch, Roberto Segala, and Frits Vaandrager. *The Theory of Timed I/O Automata*. 2nd ed. San Rafael, CA: Morgan and Claypool, 2010. ISBN: 9781608450022.

This monograph presents a basic modeling framework, Timed I/O Automata (TIOA), for describing and analyzing distributed algorithms.

In addition, some research papers that are not covered in the textbook will be covered in class and on problem sets. These papers are listed in the supplementary readings.

## Reading by Session

SES # | TOPICS | READINGS |
---|---|---|

1 | Course overview. Synchronous networks. Leader election in synchronous ring networks. | [Lynch] Chapters 1 and 2, sections 3.1 to 3.5. |

2 | Leader election in rings. Basic computational tasks in general synchronous networks: leader election. Breadth-first search. Broadcast and convergecast. Shortest paths. | [Lynch] Section 3.6, and sections 4.1 to 4.3. |

3 | Spanning trees. Shortest paths (Bellman-Ford). Minimum spanning trees. Maximal independent sets (summary). | [Lynch] Sections 4.3 to 4.5 (skip 4.5.3). |

4 | Fault-tolerant consensus. Link failures: the two generals problem. Process failures (stopping, Byzantine). Algorithms for agreement with stopping and Byzantine failures. Exponential information gathering. | [Lynch] Sections 5.1 and 6.1 to 6.3. |

5 | Number-of-processor bounds for Byzantine agreement. Weak Byzantine agreement. Time bounds for consensus problems. | [Lynch] Sections 6.4 to 6.7. Aguilera, Marcos Kawazoe, and Sam Toueg. "A Simple Bivalency Proof that ## OptionalKeidar, Idit, and Sergio Rajsbaum. "A Simple Proof of the Uniform Consensus Synchronous Lower Bound." |

6 | k-set-agreement. Approximate agreement. Distributed commit. | [Lynch] Chapter 7. |

7 | Asynchronous distributed computing. Formal modeling of asynchronous systems using interacting state machines (I/O automata). Proving correctness of distributed algorithms. | [Lynch] Chapter 8. |

8 | Non-fault-tolerant algorithms for asynchronous networks. Leader election, breadth-first search, shortest paths, broadcast and convergecast. | [Lynch] Chapters 14 and 15. |

9 | Spanning trees. Gallager et al. minimum spanning trees. | [Lynch] Sections 15.3 to 15.5. Gallager, R. G., P. A. Humblet, and P. M. Spira. "A Distributed Algorithm for Minimum-Weight Spanning Trees." |

10 | Synchronizers. Synchronizer applications. Synchronous vs. asynchronous distributed systems. | [Lynch] Chapter 16. |

11 | Time, clocks, and the ordering of events. State-machine simulation. Vector timestamps. | [Lynch] Chapter 18. Lamport, Leslie. "Time, Clocks, and the Ordering of Events in a Distributed System." Mattern, Friedemann. "Virtual Time and Global States of Distributed Systems." |

12 | Stable property detection. Distributed termination. Global snapshots. Deadlock detection. | [Lynch] Chapter 19. |

13 | Asynchronous shared-memory systems. The mutual exclusion problem. Mutual exclusion algorithms: Dijkstra’s algorithm, Peterson’s algorithm, and Lamport’s Bakery algorithm. | [Lynch] Chapter 9 and sections 10.1 to 10.7. |

14 | More mutual exclusion algorithms. Bounds on shared memory for mutual exclusion. Resource allocation. The Dining Philosophers problem. | [Lynch] Sections 10.6 to 10.8, and 10.9. |

15 | Shared-memory multiprocessors. Contention, caching, locality. Practical mutual exclusion algorithms. Reading/writing locks. | [Herlihy and Shavit] Chapter 7. [Lynch] Chapter 11. Mellor-Crummey, John M., and Michael L. Scott. "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors." ## OptionalMagnussen, Peter, Anders Landin, and Erik Hagersten. "Queue Locks on Cache Coherent Multiprocessors." |

16 | Impossibility of consensus in asynchronous, fault-prone, shared-memory systems. | [Lynch] Chapters 11 and 12. Fischer, Michael J., Nancy A. Lynch, and Michael S. Paterson. "Impossibility of Distributed Consensus with One Faulty Process." |

17 | Atomic objects | [Lynch] Chapter 13 (sections 13.1 and 13.2). |

18 | Atomic snapshot algorithms. Atomic read/write register algorithms. | [Lynch] Chapter 13 (sections 13.3 and 13.4). |

19 | List algorithms: locking algorithms, optimistic algorithms, lock-free algorithms, lazy algorithms. | [Herlihy and Shavit] Chapter 9. |

20 | Transactional memory: obstruction-free and lock-based implementations. | [Herlihy and Shavit] Chapter 18. [Guerraoui and Kapalka] Chapters 1 to 4. |

21 | Wait-free computability. The wait-free consensus hierarchy. | Herlihy, Maurice. "Wait-Free Synchronization." [Attiya and Welch] Chapter 15. |

22 | Wait-free vs. f-fault-tolerant atomic objects. Boosting fault-tolerance. | Borowsky, Elizabeth, Eli Gafni, Nancy Lynch, and Sergio Rajsbaum. "The BG Distributed Simulation Algorithm." Attie, Paul, Rachid Guerraoui, Petr Kouznetsov, Nancy Lynch, and Sergio Rajsbaum. "The Impossibility of Boosting Distributed Service Resilience." Submitted for journal publication, September 2009. Chandra, Tushar D., Vassos Hadzilacos, Prasad Jayanti, and Sam Toueg. "Generalized Irreducibility of Consensus and the Equivalence of |

23 | Asynchronous network model vs. asynchronous shared-memory model. Impossibility of consensus in asynchronous networks. Failure detectors and consensus. Paxos consensus algorithm. | [Lynch] Chapter 17. Lamport, Leslie. "The Part-Time Parliament." |

24 | Self-stabilizing algorithms | [Dolev] Chapter 2. |

25 | Timing-based systems. Modeling and verification. Timing-based algorithms for mutual exclusion and consensus. Clock synchronization. | [Lynch] Chapters 23 to 25. [Attiya and Welch] Section 6.3 and chapter 13. |

## Supplementary Readings

TOPICS | READINGS |
---|---|

Dijkstra Prize papers | In each of the years 2000-2009, a prize has been awarded to a research paper that has had a strong impact on research in the area of distributed algorithms. The prize was originally called the "PODC Influential Paper Award." After the death of Edsger Dijkstra, one of the pioneers of the field, in August 2002, the prize was renamed the "Dijkstra Prize." We will study the key contributions of most of these papers during this semester. In case you want to read the original papers for yourselves, here is a list: Lamport, Leslie. "Time, Clocks, and the Ordering of Events in a Distributed System." Fischer, Michael J., Nancy A. Lynch, and Michael S. Paterson. "Impossibility of Distributed Consensus with One Faulty Process." Dijkstra, Edsger W. "Self-Stabilizing Systems in Spite of Distributed Control." Herlihy, Maurice. "Wait-Free Synchronization." Gallager, R. G., P. A. Humblet, and P. M. Spira. "A Distributed Algorithm for Minimum-Weight Spanning Trees." Pease, M., R. Shostak, and L. Lamport. "Reaching Agreement in the Presence of Faults." Mellor-Crummey, John M., and Michael L. Scott. "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors." Dwork, Cynthia, Nancy Lynch, and Larry Stockmeyer. "Consensus in the Presence of Partial Synchrony." Awerbuch, Baruch, and David Peleg. "Sparse Partitions." Proceedings of the 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, Missouri, October, 1990, pp. 503-513. Halpern, Joseph, and Yoram Moses. "Knowledge and Common Knowledge in a Distributed Environment." |

Synchronous networks | Aguilera, Marcos Kawazoe, and Sam Toueg. "A Simple Bivalency Proof that Keidar, Idit, and Sergio Rajsbaum. "A Simple Proof of the Uniform Consensus Synchronous Lower Bound." |

Asynchronous networks | Mattern, Friedemann. "Virtual Time and Global States of Distributed Systems." In Fidge, Colin. "Logical Time in Distributed Computing Systems." Chaudhuri, Soma. "More |

Asynchronous shared memory | ## Mutual exclusionThe following paper and thesis chapter present a new, fundamental lower bound for the time required to achieve mutual exclusion. Fan, Rui, and Nancy Lynch. "An Ω Fan, Rui. "Mutual Exclusion." Chapter 4 in ## Wait-free computability and the wait-free consensus hierarchyThis paper popularized the notion of wait-free computability, and also introduced the wait-free consensus hierarchy. Herlihy, Maurice. "Wait-Free Synchronization." This paper presents an interesting observation about the wait-free consensus hierarchy. Jayanti, Prasad. "Robust Wait-Free Hierarchies." ## Wait-free vs. f-fault-tolerant data objectsChandra, Tushar D., Vassos Hadzilacos, Prasad Jayanti, and Sam Toueg. "Generalized Irreducibility of Consensus and the Equivalence of Borowsky, Elizabeth, Eli Gafni, Nancy Lynch, and Sergio Rajsbaum. "The BG Distributed Simulation Algorithm." Attie, Paul, Rachid Guerraoui, Petr Kouznetsov, Nancy Lynch, and Sergio Rajsbaum. "The Impossibility of Boosting Distributed Service Resilience." Submitted for journal publication, September 2009. ## Failure detectors, consensus, and set consensusThe idea of a "failure detector" was introduced in the following paper, which also shows how certain failure detectors can be used to solve consensus. Chandra, Tushar Deepak, and Sam Toueg. "Unreliable Failure Detectors for Reliable Distributed Systems." Lamport's "Paxos" paper solves consensus, essentially assuming an underlying failure detector service: Lamport, Leslie. "The Part-Time Parliament." The following paper proves that a certain failure detector is provably "weakest" for solving consensus: Chandra, Tushar Deepak, Vassos Hadzilacos, and Sam Toueg. "The Weakest Failure Detector for Solving Consensus." There are several new papers on failure detectors for set consensus, starting with this one: Guerraoui, Rachid, Maurice Herlihy, Petr Kouznetsov, Nancy Lynch, and Calvin Newport. "On the Weakest Failure Detector Ever." |

Multiprocessor programming | The brand-new Herlihy-Shavit textbook covers the principles of multiprocessor programming quite thoroughly: Herlihy, Maurice, and Nir Shavit. This paper introduced the MCS (Mellor-Crummey Scott) queue lock, which is fast, scalable and fair in a wide variety of multiprocessor systems. Mellor-Crummey, John M., and Michael L. Scott. "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors." Magnussen, Peter, Anders Landin, and Erik Hagersten. "Queue Locks on Cache Coherent Multiprocessors." The following monograph draft describes theoretical aspects of transactional memory. Guerraoui, Rachid, and Michal Kapalka. |

Self-stabilization | Dijkstra's breakthrough paper originated the idea of distributed algorithm self-stabilization: Dijkstra, Edsger W. "Self-Stabilizing Systems in Spite of Distributed Control." The Dolev book contains everything you might want to know about basic self-stabilizing distributed algorithms: Dolev, Shlomi. |

Timed systems | The Attiya-Welch book contains a chapter on basic clock synchronization algorithms: Attiya, Hagit, and Jennifer Welch. The following paper and thesis contain a lower bound on "gradient" clock synchronization. The thesis chapter is more up-to-date than the journal paper. Fan, Rui, and Nancy Lynch. "Gradient Clock Synchronization." Fan, Rui. "Gradient Clock Lower Bound." Chapter 2 in The following monograph contains basic formal modeling and proof methods for timing-based systems. It provides the mathematical foundation for the Tempo toolset that we will use in this course. Kaynar, Disun, Nancy Lynch, Roberto Segala, and Frits Vaandrager. |