Deep RL WS1
Deep RL WS2
Deep RL Lec CS294 Berk
Reinforcement Learning Course Lectures UCL
RL CS188
Deep RL (CS231n Lecture)
Deep RL (CS231n Slides)
- An Outsider’s Tour of Reinforcement Learning (Ben Recht!!!)
- Reinforcement Learning Series (Tutorial + Code (vids))
- A step-by-step Policy Gradient algorithms Colab + Pytorch tutorial
- Pathmind: Reinforcement Learning Simulations (Code)
- Reinforcement Learning Tutorial (Sentdex)
Intro - Reinforcement Learning
-
Reinforcement Learning:
-
Asynchronous:
-
Mathematical Formulation of RL - Markov Decision Processes:
Markov Decision ProcessDefined by \((\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathbb{P}, \gamma)\):
- \(\mathcal{S}\): set of possible states
- \(\mathcal{A}\): set of possible actions
- \(\mathcal{R}\): distribution of reward given (state, action) pair
- \(\mathbb{P}\): transition probability i.e. distribution over next state given (state, action) pair
- \(\gamma\): discount factor
MDPs Algorithm/Idea:
- At time step \(\mathrm{t}=0,\) environment samples initial state \(\mathrm{s}_ {0} \sim \mathrm{p}\left(\mathrm{s}_ {0}\right)\)
- Then, for \(\mathrm{t}=0\) until done:
- Agent selects action \(a_t\)
- Environment samples reward \(\mathrm{r}_ {\mathrm{t}} \sim \mathrm{R}\left( . \vert \mathrm{s}_{\mathrm{t}}, \mathrm{a}_ {\mathrm{t}}\right)\)
- Environment samples next state \(\mathrm{s}_ {\mathrm{t}+1} \sim \mathrm{P}\left( . \vert \mathrm{s}_ {\mathrm{t}}, \mathrm{a}_ {\mathrm{t}}\right)\)
- Agent receives reward \(\mathrm{r}_ {\mathrm{t}}\) and next state \(\mathrm{s}_ {\mathrm{t}+1}\)
- A policy \(\pi\) is a function from \(S\) to \(A\) that specifies what action to take in each state
- Objective: find policy \(\pi^{\ast}\) that maximizes cumulative discounted reward:$$\sum_{t \geq 0} \gamma^{t} r_{t}$$
Optimal Policy \(\pi^{\ast}\):
We want to find optimal policy \(\mathbf{n}^{\ast}\) that maximizes the sum of rewards.
We handle randomness (initial state, transition probability…) by Maximizing the expected sum of rewards.
Formally,$$\pi^{* }=\arg \max _{\pi} \mathbb{E}\left[\sum_{t \geq 0} \gamma^{t} r_{t} | \pi\right] \quad$ \text{ with } $s_{0} \sim p\left(s_{0}\right), a_{t} \sim \pi\left(\cdot | s_{t}\right), s_{t+1} \sim p\left(\cdot | s_{t}, a_{t}\right)$$
The Bellman Equations:
Definition of “optimal utility” via expectimax recurrence gives a simple one-step lookahead relationship amongst optimal utility values.
The Bellman Equations characterize optimal values:$$\begin{aligned} V^{ * }(s) &= \max _{a}\left(s^{*}(s, a)\right. \\ Q^{ * }(s, a) &= \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^{ * }\left(s^{\prime}\right)\right] \\ V^{ * }(s) &= \max _{a} \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^{ * }\left(s^{\prime}\right)\right] \end{aligned}$$
Value Iteration Algorithm:
The Value Iteration algorithm computes the optimal values:$$V_{k+1}(s) \leftarrow \max _{a} \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k}\left(s^{\prime}\right)\right]$$
- Value iteration is just a fixed point solution method.
- It is repeated bellman equations.Convergence:
Issues:
- Problem 1: It’s slow – \(O(S^2A)\) per iteration
- Problem 2: The “max” at each state rarely changes
- Problem 3: The policy often converges long before the values
- Problem 4: Not scalable. Must compute \(Q(s, a)\) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space
Policy Iteration:
It is an Alternative approach for optimal values:
Policy Iteration algorithm:- Step #1 Policy evaluation: calculate utilities for some fixed policy (not) of to utilitiesl until convergence
- Step #2: Policy improvement: update policy using one-step look-ahead with resulting (but not optimall) utilities af future values
-
Repeat steps until policy converges
- Evaluation:
For fixed current policy \(\pi\), find values with policy evaluation:- Iterate until values converge:
$$V_{k+1}^{\pi_{i}}(s) \leftarrow \sum_{s^{\prime}} T\left(s, \pi_{i}(s), s^{\prime}\right)\left[R\left(s, \pi_{i}(s), s^{\prime}\right)+\gamma V_{k}^{\pi_{i}}\left(s^{\prime}\right)\right]$$
- Iterate until values converge:
- Improvement:
For fixed values, get a better policy using policy extraction:- One-step look-ahead:
$$\pi_{i+1}(s)=\arg \max _{a} \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^{\pi_{i}}\left(s^{\prime}\right)\right]$$
- One-step look-ahead:
Properties:
- It’s still optimal
- Can can converge (much) faster under some conditions
Comparison - Value Iteration vs Policy Iteration:
Q-Learning | Solving for Optimal Policy:
A problem with value iteration was: It is Not scalable. Must compute \(Q(s, a)\) for every state-action pair.
Q-Learning solves this by using a function approximator to estimate the action-value function:$$Q(s, a ; \theta) \approx Q^{* }(s, a)$$
Deep Q-learning: the case where the function approximator is a deep neural net.
Training:
Experience Replay
Deep Q-learning with Experience Replay - Algorithm:
Policy Gradients:
An alternative to learning a Q-function.
Q-functions can be very complicated.Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair.
- Define a class of parameterized policies:
$$\Pi=\left\{\pi_{\theta}, \theta \in \mathbb{R}^{m}\right\}$$
- For each policy, define its value:
$$J(\theta)=\mathbb{E}\left[\sum_{t \geq 0} \gamma^{t} r_{t} | \pi_{\theta}\right]$$
- Find the optimal policy \(\theta^{ * }=\arg \max _ {\theta} J(\theta)\) by gradient ascent on policy parameters (REINFORCE Algorithm)
REINFORCE Algorithm:
Expected Reward:$$\begin{aligned} J(\theta) &=\mathbb{E}_{\tau \sim p(\tau ; \theta)}[r(\tau)] \\ &=\int_{\tau} r(\tau) p(\tau ; \theta) \mathrm{d} \tau \end{aligned}$$
where \(r(\tau)\) is the reward of a trajectory \(\tau=\left(s_{0}, a_{0}, r_{0}, s_{1}, \dots\right)\).
The Gradient:$$\nabla_{\theta} J(\theta)=\int_{\tau} r(\tau) \nabla_{\theta} p(\tau ; \theta) \mathrm{d} \tau$$
- The Gradient is Intractable. Gradient of an expectation is problematic when \(p\) depends on \(\theta\).
- Solution:- Trick:
$$\nabla_{\theta} p(\tau ; \theta)=p(\tau ; \theta) \frac{\nabla_{\theta} p(\tau ; \theta)}{p(\tau ; \theta)}=p(\tau ; \theta) \nabla_{\theta} \log p(\tau ; \theta)$$
- Injecting Back:
$$\begin{aligned} \nabla_{\theta} J(\theta) &=\int_{\tau}\left(r(\tau) \nabla_{\theta} \log p(\tau ; \theta)\right) p(\tau ; \theta) \mathrm{d} \tau \\ &=\mathbb{E}_{\tau \sim p(\tau ; \theta)}\left[r(\tau) \nabla_{\theta} \log p(\tau ; \theta)\right] \end{aligned}$$
- Estimating the Gradient: Can estimate with Monte Carlo sampling.
- The gradient does NOT depend on transition probabilities:
- \[p(\tau ; \theta)=\prod_{t \geq 0} p\left(s_{t+1} | s_{t}, a_{t}\right) \pi_{\theta}\left(a_{t} | s_{t}\right)\]
- \(\log p(\tau ; \theta)=\sum_{t \geq 0} \log p\left(s_{t+1} | s_{t}, a_{t}\right)+\log \pi_{\theta}\left(a_{t} | s_{t}\right)\)
\(\implies\) - \[\nabla_{\theta} \log p(\tau ; \theta)=\sum_{t \geq 0} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\]
- Therefore when sampling a trajectory \(\tau,\) we can estimate \(J(\theta)\) with:
$$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0} r(\tau) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
- The gradient does NOT depend on transition probabilities:
- Gradient Estimator:
$$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0} r(\tau) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
- Intuition/Interpretation:
- If \(\mathrm{r}(\tau)\) is high, push up the probabilities of the actions seen
- If \(\mathrm{r}(\tau)\) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!
- Variance:
- Issue: This also suffers from high variance because credit assignment is really hard.
- Variance Reduction - Two Ideas:
- Push up probabilities of an action seen, only by the cumulative future reward from that state:
$$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(\sum_{t^{\prime} \geq t} r_{t^{\prime}}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
- Use discount factor \(\gamma\) to ignore delayed effects
$$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(\sum_{t^{\prime} \geq t} \gamma^{t^{\prime}-t} r_{t^{\prime}}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
- Problem: The raw value of a trajectory isn’t necessarily meaningful. For example, if rewards are all positive, you keep pushing up probabilities of actions.
- What is important then: Whether a reward is better or worse than what you expect to get.
- Solution: Introduce a baseline function dependent on the state.
-Concretely, estimator is now:$$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(\sum_{t^{\prime} \geq t} \gamma^{t^{\prime}-t} r_{t^{\prime}}-b\left(s_{t}\right)\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
- Choosing a Baseline:
- Vanilla REINFORCE:
A simple baseline: constant moving average of rewards experienced so far from all trajectories. - Actor-Critic:
We want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.
Intuitively, we are happy with an action \(a_{t}\) in a state \(s_{t}\) if \(Q^{\pi}\left(s_{t}, a_{t}\right)-V^{\pi}\left(s_{t}\right)\) is large. On the contrary, we are unhappy with an action if it’s small.
Now, the estimator:$$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(Q^{\pi_{\theta}}\left(s_{t}, a_{t}\right)-V^{\pi_{\theta}}\left(s_{t}\right)\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
- Learning \(Q\) and \(V\):
We learn \(Q, V\) using the Actor-Critic Algorithm.
- Learning \(Q\) and \(V\):
- Vanilla REINFORCE:
- Push up probabilities of an action seen, only by the cumulative future reward from that state:
- Intuition/Interpretation:
Actor-Critic Algorithm:
An algorithm to learn \(Q\) and \(V\).
We can combine Policy Gradients and Q-learning by training both:- Actor: the policy, and
- Critic: the Q-function
Details:
- The actor decides which action to take, and the critic tells the actor how good its action was and how it should adjust
- Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy
- Can also incorporate Q-learning tricks e.g. experience replay
- Remark: we can define by the advantage function how much an action was better than expected
Algorithm:
Summary:- Policy gradients: very general but suffer from high variance so requires a lot of samples. Challenge: sample-efficiency
-
Q-learning: does not always work but when it works, usually more sample-efficient. Challenge: exploration
- Guarantees:
- Policy Gradients: Converges to a local minima of J(𝜃), often good enough!
- Q-learning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator