Table of Contents



Deep RL WS1
Deep RL WS2
Deep RL Lec CS294 Berk
Reinforcement Learning Course Lectures UCL
RL CS188
Deep RL (CS231n Lecture)
Deep RL (CS231n Slides)

Intro - Reinforcement Learning

  1. Reinforcement Learning:

    img

  2. Asynchronous:

  3. Mathematical Formulation of RL - Markov Decision Processes:
    Markov Decision Process

    Defined by \((\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathbb{P}, \gamma)\):

    • \(\mathcal{S}\): set of possible states
    • \(\mathcal{A}\): set of possible actions
    • \(\mathcal{R}\): distribution of reward given (state, action) pair
    • \(\mathbb{P}\): transition probability i.e. distribution over next state given (state, action) pair
    • \(\gamma\): discount factor

    MDPs Algorithm/Idea:

    • At time step \(\mathrm{t}=0,\) environment samples initial state \(\mathrm{s}_ {0} \sim \mathrm{p}\left(\mathrm{s}_ {0}\right)\)
    • Then, for \(\mathrm{t}=0\) until done:
      • Agent selects action \(a_t\)
      • Environment samples reward \(\mathrm{r}_ {\mathrm{t}} \sim \mathrm{R}\left( . \vert \mathrm{s}_{\mathrm{t}}, \mathrm{a}_ {\mathrm{t}}\right)\)
      • Environment samples next state \(\mathrm{s}_ {\mathrm{t}+1} \sim \mathrm{P}\left( . \vert \mathrm{s}_ {\mathrm{t}}, \mathrm{a}_ {\mathrm{t}}\right)\)
      • Agent receives reward \(\mathrm{r}_ {\mathrm{t}}\) and next state \(\mathrm{s}_ {\mathrm{t}+1}\)

    - A policy \(\pi\) is a function from \(S\) to \(A\) that specifies what action to take in each state
    - Objective: find policy \(\pi^{\ast}\) that maximizes cumulative discounted reward:

    $$\sum_{t \geq 0} \gamma^{t} r_{t}$$

    Optimal Policy \(\pi^{\ast}\):
    We want to find optimal policy \(\mathbf{n}^{\ast}\) that maximizes the sum of rewards.
    We handle randomness (initial state, transition probability…) by Maximizing the expected sum of rewards.
    Formally,

    $$\pi^{* }=\arg \max _{\pi} \mathbb{E}\left[\sum_{t \geq 0} \gamma^{t} r_{t} | \pi\right] \quad$ \text{ with } $s_{0} \sim p\left(s_{0}\right), a_{t} \sim \pi\left(\cdot | s_{t}\right), s_{t+1} \sim p\left(\cdot | s_{t}, a_{t}\right)$$

    The Bellman Equations:
    Definition of “optimal utility” via expectimax recurrence gives a simple one-step lookahead relationship amongst optimal utility values.
    The Bellman Equations characterize optimal values:

    $$\begin{aligned} V^{ * }(s) &= \max _{a}\left(s^{*}(s, a)\right. \\ Q^{ * }(s, a) &= \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^{ * }\left(s^{\prime}\right)\right] \\ V^{ * }(s) &= \max _{a} \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^{ * }\left(s^{\prime}\right)\right] \end{aligned}$$

    Value Iteration Algorithm:
    The Value Iteration algorithm computes the optimal values:

    $$V_{k+1}(s) \leftarrow \max _{a} \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k}\left(s^{\prime}\right)\right]$$

    - Value iteration is just a fixed point solution method.
    - It is repeated bellman equations.

    Convergence:

    Issues:

    • Problem 1: It’s slow – \(O(S^2A)\) per iteration
    • Problem 2: The “max” at each state rarely changes
    • Problem 3: The policy often converges long before the values
    • Problem 4: Not scalable. Must compute \(Q(s, a)\) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space

    Policy Iteration:
    It is an Alternative approach for optimal values:
    Policy Iteration algorithm:

    • Step #1 Policy evaluation: calculate utilities for some fixed policy (not) of to utilitiesl until convergence
    • Step #2: Policy improvement: update policy using one-step look-ahead with resulting (but not optimall) utilities af future values
    • Repeat steps until policy converges

    • Evaluation:
      For fixed current policy \(\pi\), find values with policy evaluation:
      • Iterate until values converge:

        $$V_{k+1}^{\pi_{i}}(s) \leftarrow \sum_{s^{\prime}} T\left(s, \pi_{i}(s), s^{\prime}\right)\left[R\left(s, \pi_{i}(s), s^{\prime}\right)+\gamma V_{k}^{\pi_{i}}\left(s^{\prime}\right)\right]$$

    • Improvement:
      For fixed values, get a better policy using policy extraction:
      • One-step look-ahead:

        $$\pi_{i+1}(s)=\arg \max _{a} \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^{\pi_{i}}\left(s^{\prime}\right)\right]$$

    Properties:

    • It’s still optimal
    • Can can converge (much) faster under some conditions

    Comparison - Value Iteration vs Policy Iteration:

    Q-Learning | Solving for Optimal Policy:
    A problem with value iteration was: It is Not scalable. Must compute \(Q(s, a)\) for every state-action pair.
    Q-Learning solves this by using a function approximator to estimate the action-value function:

    $$Q(s, a ; \theta) \approx Q^{* }(s, a)$$

    Deep Q-learning: the case where the function approximator is a deep neural net.

    Training:
    img

    Experience Replay
    img

    Deep Q-learning with Experience Replay - Algorithm:


    Policy Gradients:
    An alternative to learning a Q-function.
    Q-functions can be very complicated.

    Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair.

    - Define a class of parameterized policies:

    $$\Pi=\left\{\pi_{\theta}, \theta \in \mathbb{R}^{m}\right\}$$

    - For each policy, define its value:

    $$J(\theta)=\mathbb{E}\left[\sum_{t \geq 0} \gamma^{t} r_{t} | \pi_{\theta}\right]$$

    - Find the optimal policy \(\theta^{ * }=\arg \max _ {\theta} J(\theta)\) by gradient ascent on policy parameters (REINFORCE Algorithm)

    REINFORCE Algorithm:
    Expected Reward:

    $$\begin{aligned} J(\theta) &=\mathbb{E}_{\tau \sim p(\tau ; \theta)}[r(\tau)] \\ &=\int_{\tau} r(\tau) p(\tau ; \theta) \mathrm{d} \tau \end{aligned}$$

    where \(r(\tau)\) is the reward of a trajectory \(\tau=\left(s_{0}, a_{0}, r_{0}, s_{1}, \dots\right)\).
    The Gradient:

    $$\nabla_{\theta} J(\theta)=\int_{\tau} r(\tau) \nabla_{\theta} p(\tau ; \theta) \mathrm{d} \tau$$

    - The Gradient is Intractable. Gradient of an expectation is problematic when \(p\) depends on \(\theta\).
    - Solution:

    • Trick:

      $$\nabla_{\theta} p(\tau ; \theta)=p(\tau ; \theta) \frac{\nabla_{\theta} p(\tau ; \theta)}{p(\tau ; \theta)}=p(\tau ; \theta) \nabla_{\theta} \log p(\tau ; \theta)$$

    • Injecting Back:

      $$\begin{aligned} \nabla_{\theta} J(\theta) &=\int_{\tau}\left(r(\tau) \nabla_{\theta} \log p(\tau ; \theta)\right) p(\tau ; \theta) \mathrm{d} \tau \\ &=\mathbb{E}_{\tau \sim p(\tau ; \theta)}\left[r(\tau) \nabla_{\theta} \log p(\tau ; \theta)\right] \end{aligned}$$

    • Estimating the Gradient: Can estimate with Monte Carlo sampling.
      • The gradient does NOT depend on transition probabilities:
        • \[p(\tau ; \theta)=\prod_{t \geq 0} p\left(s_{t+1} | s_{t}, a_{t}\right) \pi_{\theta}\left(a_{t} | s_{t}\right)\]
        • \(\log p(\tau ; \theta)=\sum_{t \geq 0} \log p\left(s_{t+1} | s_{t}, a_{t}\right)+\log \pi_{\theta}\left(a_{t} | s_{t}\right)\)
          \(\implies\)
        • \[\nabla_{\theta} \log p(\tau ; \theta)=\sum_{t \geq 0} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\]
      • Therefore when sampling a trajectory \(\tau,\) we can estimate \(J(\theta)\) with:

        $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0} r(\tau) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$

    • Gradient Estimator:

      $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0} r(\tau) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$

      • Intuition/Interpretation:
        • If \(\mathrm{r}(\tau)\) is high, push up the probabilities of the actions seen
        • If \(\mathrm{r}(\tau)\) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!
      • Variance:
        • Issue: This also suffers from high variance because credit assignment is really hard.
        • Variance Reduction - Two Ideas:
          1. Push up probabilities of an action seen, only by the cumulative future reward from that state:

            $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(\sum_{t^{\prime} \geq t} r_{t^{\prime}}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$

          2. Use discount factor \(\gamma\) to ignore delayed effects

            $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(\sum_{t^{\prime} \geq t} \gamma^{t^{\prime}-t} r_{t^{\prime}}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$

          - Problem: The raw value of a trajectory isn’t necessarily meaningful. For example, if rewards are all positive, you keep pushing up probabilities of actions.
          - What is important then: Whether a reward is better or worse than what you expect to get.
          - Solution: Introduce a baseline function dependent on the state.
          -Concretely, estimator is now:

          $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(\sum_{t^{\prime} \geq t} \gamma^{t^{\prime}-t} r_{t^{\prime}}-b\left(s_{t}\right)\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$

          • Choosing a Baseline:
            • Vanilla REINFORCE:
              A simple baseline: constant moving average of rewards experienced so far from all trajectories.
            • Actor-Critic:
              We want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.
              Intuitively, we are happy with an action \(a_{t}\) in a state \(s_{t}\) if \(Q^{\pi}\left(s_{t}, a_{t}\right)-V^{\pi}\left(s_{t}\right)\) is large. On the contrary, we are unhappy with an action if it’s small.
              Now, the estimator:

              $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(Q^{\pi_{\theta}}\left(s_{t}, a_{t}\right)-V^{\pi_{\theta}}\left(s_{t}\right)\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$

              • Learning \(Q\) and \(V\):
                We learn \(Q, V\) using the Actor-Critic Algorithm.

    Actor-Critic Algorithm:
    An algorithm to learn \(Q\) and \(V\).
    We can combine Policy Gradients and Q-learning by training both:

    • Actor: the policy, and
    • Critic: the Q-function

    Details:

    • The actor decides which action to take, and the critic tells the actor how good its action was and how it should adjust
    • Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy
    • Can also incorporate Q-learning tricks e.g. experience replay
    • Remark: we can define by the advantage function how much an action was better than expected

    Algorithm:


    Summary:

    • Policy gradients: very general but suffer from high variance so requires a lot of samples. Challenge: sample-efficiency
    • Q-learning: does not always work but when it works, usually more sample-efficient. Challenge: exploration

    • Guarantees:
      • Policy Gradients: Converges to a local minima of J(𝜃), often good enough!
      • Q-learning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator