Ahmad Badary

Intro - Reinforcement Learning

Reinforcement Learning:
Asynchronous:
Mathematical Formulation of RL - Markov Decision Processes:
Markov Decision Process

Defined by $(\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathbb{P}, \gamma)$:
- $\mathcal{S}$: set of possible states
- $\mathcal{A}$: set of possible actions
- $\mathcal{R}$: distribution of reward given (state, action) pair
- $\mathbb{P}$: transition probability i.e. distribution over next state given (state, action) pair
- $\gamma$: discount factor
MDPs Algorithm/Idea:
- At time step $\mathrm{t}=0,$ environment samples initial state $\mathrm{s}_ {0} \sim \mathrm{p}\left(\mathrm{s}_ {0}\right)$
- Then, for $\mathrm{t}=0$ until done:
  - Agent selects action $a_t$
  - Environment samples reward $\mathrm{r}_ {\mathrm{t}} \sim \mathrm{R}\left( . \vert \mathrm{s}_{\mathrm{t}}, \mathrm{a}_ {\mathrm{t}}\right)$
  - Environment samples next state $\mathrm{s}_ {\mathrm{t}+1} \sim \mathrm{P}\left( . \vert \mathrm{s}_ {\mathrm{t}}, \mathrm{a}_ {\mathrm{t}}\right)$
  - Agent receives reward $\mathrm{r}_ {\mathrm{t}}$ and next state $\mathrm{s}_ {\mathrm{t}+1}$
- A policy $\pi$ is a function from $S$ to $A$ that specifies what action to take in each state
- Objective: find policy $\pi^{\ast}$ that maximizes cumulative discounted reward:

$$\sum_{t \geq 0} \gamma^{t} r_{t}$$

Optimal Policy $\pi^{\ast}$:
We want to find optimal policy $\mathbf{n}^{\ast}$ that maximizes the sum of rewards.
We handle randomness (initial state, transition probability…) by Maximizing the expected sum of rewards.
Formally,

$$\pi^{* }=\arg \max _{\pi} \mathbb{E}\left[\sum_{t \geq 0} \gamma^{t} r_{t} | \pi\right] \quad$ \text{ with } $s_{0} \sim p\left(s_{0}\right), a_{t} \sim \pi\left(\cdot | s_{t}\right), s_{t+1} \sim p\left(\cdot | s_{t}, a_{t}\right)$$

The Bellman Equations:
Definition of “optimal utility” via expectimax recurrence gives a simple one-step lookahead relationship amongst optimal utility values.
The Bellman Equations characterize optimal values:

$$\begin{aligned} V^{ * }(s) &= \max _{a}\left(s^{*}(s, a)\right. \\ Q^{ * }(s, a) &= \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^{ * }\left(s^{\prime}\right)\right] \\ V^{ * }(s) &= \max _{a} \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^{ * }\left(s^{\prime}\right)\right] \end{aligned}$$

Value Iteration Algorithm:
The Value Iteration algorithm computes the optimal values:

$$V_{k+1}(s) \leftarrow \max _{a} \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k}\left(s^{\prime}\right)\right]$$

- Value iteration is just a fixed point solution method.
- It is repeated bellman equations.

Convergence:

Issues:
- Problem 1: It’s slow – $O(S^2A)$ per iteration
- Problem 2: The “max” at each state rarely changes
- Problem 3: The policy often converges long before the values
- Problem 4: Not scalable. Must compute $Q(s, a)$ for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space
Policy Iteration:
It is an Alternative approach for optimal values:
Policy Iteration algorithm:
- Step #1 Policy evaluation: calculate utilities for some fixed policy (not) of to utilitiesl until convergence
- Step #2: Policy improvement: update policy using one-step look-ahead with resulting (but not optimall) utilities af future values
- Repeat steps until policy converges
- Evaluation:
  For fixed current policy $\pi$, find values with policy evaluation:
  - Iterate until values converge:
    $$V_{k+1}^{\pi_{i}}(s) \leftarrow \sum_{s^{\prime}} T\left(s, \pi_{i}(s), s^{\prime}\right)\left[R\left(s, \pi_{i}(s), s^{\prime}\right)+\gamma V_{k}^{\pi_{i}}\left(s^{\prime}\right)\right]$$
- Improvement:
  For fixed values, get a better policy using policy extraction:
  - One-step look-ahead:
    $$\pi_{i+1}(s)=\arg \max _{a} \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^{\pi_{i}}\left(s^{\prime}\right)\right]$$
Properties:
- It’s still optimal
- Can can converge (much) faster under some conditions
Comparison - Value Iteration vs Policy Iteration:

Q-Learning | Solving for Optimal Policy:
A problem with value iteration was: It is Not scalable. Must compute $Q(s, a)$ for every state-action pair.
Q-Learning solves this by using a function approximator to estimate the action-value function:

$$Q(s, a ; \theta) \approx Q^{* }(s, a)$$

Deep Q-learning: the case where the function approximator is a deep neural net.

Training:

Experience Replay

Deep Q-learning with Experience Replay - Algorithm:

Policy Gradients:
An alternative to learning a Q-function.
Q-functions can be very complicated.

Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair.

- Define a class of parameterized policies:

$$\Pi=\left\{\pi_{\theta}, \theta \in \mathbb{R}^{m}\right\}$$

- For each policy, define its value:

$$J(\theta)=\mathbb{E}\left[\sum_{t \geq 0} \gamma^{t} r_{t} | \pi_{\theta}\right]$$

- Find the optimal policy $\theta^{ * }=\arg \max _ {\theta} J(\theta)$ by gradient ascent on policy parameters (REINFORCE Algorithm)

REINFORCE Algorithm:
Expected Reward:

$$\begin{aligned} J(\theta) &=\mathbb{E}_{\tau \sim p(\tau ; \theta)}[r(\tau)] \\ &=\int_{\tau} r(\tau) p(\tau ; \theta) \mathrm{d} \tau \end{aligned}$$

where $r(\tau)$ is the reward of a trajectory $\tau=\left(s_{0}, a_{0}, r_{0}, s_{1}, \dots\right)$.
The Gradient:

$$\nabla_{\theta} J(\theta)=\int_{\tau} r(\tau) \nabla_{\theta} p(\tau ; \theta) \mathrm{d} \tau$$

- The Gradient is Intractable. Gradient of an expectation is problematic when $p$ depends on $\theta$.
- Solution:
- Trick:
  $$\nabla_{\theta} p(\tau ; \theta)=p(\tau ; \theta) \frac{\nabla_{\theta} p(\tau ; \theta)}{p(\tau ; \theta)}=p(\tau ; \theta) \nabla_{\theta} \log p(\tau ; \theta)$$
- Injecting Back:
  $$\begin{aligned} \nabla_{\theta} J(\theta) &=\int_{\tau}\left(r(\tau) \nabla_{\theta} \log p(\tau ; \theta)\right) p(\tau ; \theta) \mathrm{d} \tau \\ &=\mathbb{E}_{\tau \sim p(\tau ; \theta)}\left[r(\tau) \nabla_{\theta} \log p(\tau ; \theta)\right] \end{aligned}$$
- Estimating the Gradient: Can estimate with Monte Carlo sampling.
  - The gradient does NOT depend on transition probabilities:
    - \[p(\tau ; \theta)=\prod_{t \geq 0} p\left(s_{t+1} | s_{t}, a_{t}\right) \pi_{\theta}\left(a_{t} | s_{t}\right)\]
    - $\log p(\tau ; \theta)=\sum_{t \geq 0} \log p\left(s_{t+1} | s_{t}, a_{t}\right)+\log \pi_{\theta}\left(a_{t} | s_{t}\right)$
      $\implies$
    - \[\nabla_{\theta} \log p(\tau ; \theta)=\sum_{t \geq 0} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\]
  - Therefore when sampling a trajectory $\tau,$ we can estimate $J(\theta)$ with:
    $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0} r(\tau) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
- Gradient Estimator:
  $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0} r(\tau) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
  - Intuition/Interpretation:
    - If $\mathrm{r}(\tau)$ is high, push up the probabilities of the actions seen
    - If $\mathrm{r}(\tau)$ is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!
  - Variance:
    - Issue: This also suffers from high variance because credit assignment is really hard.
    - Variance Reduction - Two Ideas:
      1. Push up probabilities of an action seen, only by the cumulative future reward from that state:
        $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(\sum_{t^{\prime} \geq t} r_{t^{\prime}}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
      2. Use discount factor $\gamma$ to ignore delayed effects
        $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(\sum_{t^{\prime} \geq t} \gamma^{t^{\prime}-t} r_{t^{\prime}}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
      - Problem: The raw value of a trajectory isn’t necessarily meaningful. For example, if rewards are all positive, you keep pushing up probabilities of actions.
      - What is important then: Whether a reward is better or worse than what you expect to get.
      - Solution: Introduce a baseline function dependent on the state.
      -Concretely, estimator is now:
      
      $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(\sum_{t^{\prime} \geq t} \gamma^{t^{\prime}-t} r_{t^{\prime}}-b\left(s_{t}\right)\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
      - Choosing a Baseline:
        
        Vanilla REINFORCE:
        A simple baseline: constant moving average of rewards experienced so far from all trajectories.
        
        Actor-Critic:
        We want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.
        Intuitively, we are happy with an action $a_{t}$ in a state $s_{t}$ if $Q^{\pi}\left(s_{t}, a_{t}\right)-V^{\pi}\left(s_{t}\right)$ is large. On the contrary, we are unhappy with an action if it’s small.
        Now, the estimator:
        $$\nabla_{\theta} J(\theta) \approx \sum_{t \geq 0}\left(Q^{\pi_{\theta}}\left(s_{t}, a_{t}\right)-V^{\pi_{\theta}}\left(s_{t}\right)\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)$$
        
        Learning $Q$ and $V$:
        We learn $Q, V$ using the Actor-Critic Algorithm.
Actor-Critic Algorithm:
An algorithm to learn $Q$ and $V$.
We can combine Policy Gradients and Q-learning by training both:
- Actor: the policy, and
- Critic: the Q-function
Details:
- The actor decides which action to take, and the critic tells the actor how good its action was and how it should adjust
- Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy
- Can also incorporate Q-learning tricks e.g. experience replay
- Remark: we can define by the advantage function how much an action was better than expected
Algorithm:

Summary:
- Policy gradients: very general but suffer from high variance so requires a lot of samples. Challenge: sample-efficiency
- Q-learning: does not always work but when it works, usually more sample-efficient. Challenge: exploration
- Guarantees:
  - Policy Gradients: Converges to a local minima of J(𝜃), often good enough!
  - Q-learning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator

Reinforcement Learning

Table of Contents

Intro - Reinforcement Learning