Bertsekas = Bertsekas, Dimitri P. Dynamic Programming and Optimal Control. 2 vols. Belmont, MA: Athena Scientific, 2007. ISBN: 9781886529083.
Bertsekas and Tsitsiklis = Bertsekas, Dimitri P., and John N. Tsitsiklis. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996. ISBN: 9781886529106.
LEC # | TOPICS | READINGS |
---|---|---|
1 |
Markov Decision Processes Finite-Horizon Problems: Backwards Induction Discounted-Cost Problems: Cost-to-Go Function, Bellman’s Equation |
Bertsekas Vol. 1, Chapter 1. |
2 |
Value Iteration Existence and Uniqueness of Bellman’s Equation Solution Gauss-Seidel Value Iteration |
Bertsekas Vol. 2, Chapter 1. Bertsekas and Tsitsiklis, Chapter 2. |
3 |
Optimality of Policies derived from the Cost-to-Go Function Policy Iteration Asynchronous Policy Iteration |
Bertsekas Vol. 2, Chapter 1. Bertsekas and Tsitsiklis, Chapter 2. |
4 |
Average-Cost Problems Relationship with Discounted-Cost Problems Bellman’s Equation Blackwell Optimality |
Bertsekas Vol. 2, Chapter 4. |
5 |
Average-Cost Problems Computational Methods |
Bertsekas Vol. 2, Chapter 4. |
6 |
Application of Value Iteration to Optimization of Multiclass Queueing Networks Introduction to Simulation-based Methods Real-Time Value Iteration |
Chen, R. R., and S. P. Meyn. “Value Iteration and Optimization of Multiclass Queueing Networks."Queueing Systems 32 (1999): 65-97. Bertsekas and Tsitsiklis, Chapter 5. |
7 |
Q-Learning Stochastic Approximations |
Bertsekas and Tsitsiklis, Chapters 4 and 5. |
8 |
Stochastic Approximations: Lyapunov Function Analysis The ODE Method Convergence of Q-Learning |
Bertsekas and Tsitsiklis, Chapters 4 and 5. |
9 | Exploration versus Exploitation: The Complexity of Reinforcement Learning | Kearns, M. , and S. Singh. “Near-Optional Reinforcement Learning in Polynomial Time.” Machine Learning 49, no. 2 (Nov 2002): 209-232. |
10 |
Introduction to Value Function Approximation Curse of Dimensionality Approximation Architectures |
Bertsekas and Tsitsiklis, Chapter 6. |
11 | Model Selection and Complexity | Hastie, Tibshirani, and Friedmann. Chapter 7 in The Elements of Statistical Learning. New York: Springer, 2003. ISBN: 9780387952840. |
12 |
Introduction to Value Function Approximation Algorithms Performance Bounds |
Bertsekas and Tsitsiklis, Chapter 6. |
13 | Temporal-Difference Learning with Value Function Approximation | Bertsekas and Tsitsiklis, Chapter 6. |
14 | Temporal-Difference Learning with Value Function Approximation (cont.) |
Bertsekas and Tsitsiklis, Chapter 6. de Farias, D. P., and B. Van Roy. “On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning.” |
15 |
Temporal-Difference Learning with Value Function Approximation (cont.) Optimal Stopping Problems General Control Problems |
Bertsekas and Tsitsiklis, Chapter 6. de Farias, D. P., and B. Van Roy. “On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning.” Bertsekas, Borkar, and Nedic. “Improved temporal Difference Methods with Linear Function Approximation.” |
16 | Approximate Linear Programming | de Farias, D. P., and B. Van Roy. “The Linear Programming Approach to Approximate Dynamic Programming.” |
17 | Approximate Linear Programming (cont.) | de Farias, D. P., and B. Van Roy. “The Linear Programming Approach to Approximate Dynamic Programming.” |
18 | Efficient Solutions for Approximate Linear Programming |
de Farias D. P., and B. Van Roy. “On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming.” Calafiori, and Campi. “Uncertain Convex Programs: Randomized Solutions and Confidence Levels.” |
19 | Efficient Solutions for Approximate Linear Programming: Factored MDPs |
Guestrin, et al. “Efficient Solution Algorithms for Factored MDPs.” Schuurmans, and Patrascu. “Direct Value Approximation for Factored MDPs.” |
20 | Policy Search Methods | Marbach, and Tsitsiklis. “Simulation-Based Optimization of Markov Reward Processes.” (PDF) |
21 | Policy Search Methods (cont.) | Baxter, and Bartlett. “Infinite-Horizon Policy-Gradient Estimation.” |
22 |
Policy Search Methods for POMDPs Application: Call Admission Control Actor-Critic Methods |
Baxter, and Bartlett. “Infinite-Horizon Policy-Gradient Estimation.” Baxter, and Bartlett. “Experiments with Infinite-Horizon Policy-Gradient Estimation.” Konda, and Tsitsiklis. “Actor-Critic Algorithms.” (PDF) |
23 |
Guest Lecture: Prof. Nick Roy Approximate POMDP Compression |
Roy, and Gordon. “Exponential Family PCA for Belief Compression in POMDPs.” |
24 |
Policy Search Methods: PEGASUS Application: Helicopter Control |
Ng, and Jordan. “PEGASUS: A policy search method for large MDPs and POMDPs.” Ng, et al. “Autonomous Helicopter Flight via Reinforcement Learning.” |
Complementary Reading
Even-Dar, and Mansour. “Learning Rates for Q-Learning.’’ Journal of Machine Learning Research 5 (2003): 1-25.
Barron. “Universal approximation bounds for superpositions of a sigmoidal function.” IEEE Transactions on Information Theory 39 (1993): 930-944.
Tesauro. “Temporal-Difference Learning and TD-Gammon’’ Communications of the ACM 38, no. 3 (1995).