2.997 | Spring 2004 | Graduate

Decision Making in Large Scale Systems

Readings

Bertsekas = Bertsekas, Dimitri P. Dynamic Programming and Optimal Control. 2 vols. Belmont, MA: Athena Scientific, 2007. ISBN: 9781886529083.

Bertsekas and Tsitsiklis = Bertsekas, Dimitri P., and John N. Tsitsiklis. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996. ISBN: 9781886529106.

LEC # TOPICS READINGS
1

Markov Decision Processes

Finite-Horizon Problems: Backwards Induction

Discounted-Cost Problems: Cost-to-Go Function, Bellman’s Equation

Bertsekas Vol. 1, Chapter 1.
2

Value Iteration

Existence and Uniqueness of Bellman’s Equation Solution

Gauss-Seidel Value Iteration

Bertsekas Vol. 2, Chapter 1.

Bertsekas and Tsitsiklis, Chapter 2.

3

Optimality of Policies derived from the Cost-to-Go Function

Policy Iteration

Asynchronous Policy Iteration

Bertsekas Vol. 2, Chapter 1.

Bertsekas and Tsitsiklis, Chapter 2.

4

Average-Cost Problems

Relationship with Discounted-Cost Problems

Bellman’s Equation

Blackwell Optimality

Bertsekas Vol. 2, Chapter 4.
5

Average-Cost Problems

Computational Methods

Bertsekas Vol. 2, Chapter 4.
6

Application of Value Iteration to Optimization of Multiclass Queueing Networks

Introduction to Simulation-based Methods Real-Time Value Iteration

Chen, R. R., and S. P. Meyn. “Value Iteration and Optimization of Multiclass Queueing Networks."Queueing Systems 32 (1999): 65-97.

Bertsekas and Tsitsiklis, Chapter 5.

7

Q-Learning

Stochastic Approximations

Bertsekas and Tsitsiklis, Chapters 4 and 5.
8

Stochastic Approximations: Lyapunov Function Analysis

The ODE Method

Convergence of Q-Learning

Bertsekas and Tsitsiklis, Chapters 4 and 5.
9 Exploration versus Exploitation: The Complexity of Reinforcement Learning Kearns, M. , and S. Singh. “Near-Optional Reinforcement Learning in Polynomial Time.” Machine Learning 49, no. 2 (Nov 2002): 209-232.
10

Introduction to Value Function Approximation

Curse of Dimensionality

Approximation Architectures

Bertsekas and Tsitsiklis, Chapter 6.
11 Model Selection and Complexity Hastie, Tibshirani, and Friedmann. Chapter 7 in The Elements of Statistical Learning. New York: Springer, 2003. ISBN: 9780387952840.
12

Introduction to Value Function Approximation Algorithms

Performance Bounds

Bertsekas and Tsitsiklis, Chapter 6.
13 Temporal-Difference Learning with Value Function Approximation Bertsekas and Tsitsiklis, Chapter 6.
14 Temporal-Difference Learning with Value Function Approximation (cont.)

Bertsekas and Tsitsiklis, Chapter 6.

de Farias, D. P., and B. Van Roy. “On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning.”

15

Temporal-Difference Learning with Value Function Approximation (cont.)

Optimal Stopping Problems

General Control Problems

Bertsekas and Tsitsiklis, Chapter 6.

de Farias, D. P., and B. Van Roy. “On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning.”

Bertsekas, Borkar, and Nedic. “Improved temporal Difference Methods with Linear Function Approximation.”

16 Approximate Linear Programming de Farias, D. P., and B. Van Roy. “The Linear Programming Approach to Approximate Dynamic Programming.”
17 Approximate Linear Programming (cont.) de Farias, D. P., and B. Van Roy. “The Linear Programming Approach to Approximate Dynamic Programming.”
18 Efficient Solutions for Approximate Linear Programming

de Farias D. P., and B. Van Roy. “On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming.”

Calafiori, and Campi. “Uncertain Convex Programs: Randomized Solutions and Confidence Levels.”

19 Efficient Solutions for Approximate Linear Programming: Factored MDPs

Guestrin, et al. “Efficient Solution Algorithms for Factored MDPs.”

Schuurmans, and Patrascu. “Direct Value Approximation for Factored MDPs.”

20 Policy Search Methods Marbach, and Tsitsiklis. “Simulation-Based Optimization of Markov Reward Processes.” (PDF)
21 Policy Search Methods (cont.) Baxter, and Bartlett. “Infinite-Horizon Policy-Gradient Estimation.”
22

Policy Search Methods for POMDPs

Application: Call Admission Control

Actor-Critic Methods

Baxter, and Bartlett. “Infinite-Horizon Policy-Gradient Estimation.”

Baxter, and Bartlett. “Experiments with Infinite-Horizon Policy-Gradient Estimation.”

Konda, and Tsitsiklis. “Actor-Critic Algorithms.” (PDF)

23

Guest Lecture: Prof. Nick Roy

Approximate POMDP Compression

Roy, and Gordon. “Exponential Family PCA for Belief Compression in POMDPs.”
24

Policy Search Methods: PEGASUS

Application: Helicopter Control

Ng, and Jordan. “PEGASUS: A policy search method for large MDPs and POMDPs.”

Ng, et al. “Autonomous Helicopter Flight via Reinforcement Learning.”

Complementary Reading

Even-Dar, and Mansour. “Learning Rates for Q-Learning.’’ Journal of Machine Learning Research 5 (2003): 1-25.

Barron. “Universal approximation bounds for superpositions of a sigmoidal function.” IEEE Transactions on Information Theory 39 (1993): 930-944.

Tesauro. “Temporal-Difference Learning and TD-Gammon’’ Communications of the ACM 38, no. 3 (1995).

Course Info

As Taught In
Spring 2004
Level
Learning Resource Types
Lecture Notes
Projects with Examples
Problem Sets