Wednesday, May 15, 2019

RL summary - Part I

While I have been working in my first PhD project on Bayesian optimization, I want to start to think about an idea for the second project even the first project has not shown any significance yet. But well, it is better to start on multiple ideas and switch between them. For the second PhD project, I want to extend the sequential decision making setting to a more complex setting: the standard Reinforcement Learning (RL). Note that even Bayesian optimization is a less complex setting of sequential decision making, it does not necessarily mean Bayesian optimization is easier than RL. Each has their own challenges.

I have been on and off in studying and updating about RL but this time I want to put it more comprehensively by forcing myself to summarise core concepts in RL. That is the purpose of this blog post: Summarising the key concepts in RL. To start off, I summarize material in [1] here.

  • Deep RL: uses neural networks to represent or approximate state representation, value function, policy, transition function or the reward function
  • Prediction in RL: basically it is a policy evaluation, i.e., given a policy, we wish to compute the state value function or the action-state value function. 
  • Control in RL: This is about finding the optimal policy to act. 
  • Planning in RL: how you plan to act is described by a model for policy. 
  • On-policy methods: simply evaluate and improve the current policy. For example, SARSA
  • Off-policy methods: Learn an optimal policy by using external knowledge such as some external behavior policy or samples other than the samples generated. For example, Q-learning
  • Exploration-exploitation dilemma: The agent needs to exploit the previously acquired best actions to maximize the rewards but at the same time might need to explore new actions to learn about the environment. 
  • Model-free methods: The agent directly learns from experience via trial and error, and the model for the state transition and reward are not known. 
  • Model-based methods: The model for the state transition and/or reward can be given or learned from experience. 
  • Online training: trained on the data acquired in sequence. 
  • Offline training: trained on the entire dataset. 
  • Bootstrapping for estimating the state value and action value: Update an estimate of the state value and action value using subsequent estimates (?). 
  • Markov Decision Process (MDP): This is a basic (and nice) environment for RL, defined by the 5-tuple $(\mathcal{S},\mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)$. Given the current state, the future situation depends only on the current state and not on the past states. With the system model (the transition probability $\mathcal{P}$ and the reward model $\mathcal{R}$) are known, this can be solved by dynamic programming (DP). 
  • Partially Observable Markov Decision Process (POMDP): In this RL environment, the agent cannot directly observe the state but rather only some observations of the state. 
  • MDP and POMDP are just two particular (and nice) RL environments. Other RL environments can be multi-armed bandit and real games. 

State value and action-state value

A state value function (at a state with a policy) is the expected accumulative discounted reward if the agent is in a some state and following some policy thereafter:

$$V_{\pi}(s) = \mathbb{E}[R_t | S_t = s]$$
where $R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$. 

A state-action value function is the state value function at a particular action: 

$$Q_{\pi}(s,a) = \mathbb{R}[R_t | S_t = s, A_t = a]. $$

With those definitions, we have the Bellman equation:

$$V_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V_{\pi}(s')]$$ 

$$V_*(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma V_*(s')] $$

$$Q_{\pi}(s,a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \sum_{a'} \pi(a' | s') Q_{\pi}(s', a')]$$

$$Q_*(s,a) = \sum_{s', r} p(s', r | s,a)[r + \gamma \max_{a'} Q_{*}(s', a')]$$

As an example, think of the shortest-path problem. With the global information about the graph, Dijkstra's algorithm is already very efficient. Without such global information, RL can learn in a model-free mode by trying different paths and learn from this experience. 

Note that the state here 

Exploration vs exploitation 

  • $\epsilon$-greedy: Select $a = arg\max_{a}Q(s,a)$ with probability $1 - \epsilon$, and select a random action with probability $\epsilon$. 
  • Other methods for balancing exploration and exploitation: optimistic initialization, upper confidence bound, probability matching, information state search

Dynamic Programming 

DP assumes the full knowledge of the transition and reward models of the MDP. 
  • Policy evaluation: Iteratively estimate the state value function of a policy using the Bellman equation. This converges to the state value function of the considered policy. 
  • Policy improvement: Given the state value of a policy, select actions greedily according to this state value function. 
  • Policy iteration (PI): The alternative between the iterative policy evaluation and the greedy policy improvement converges to an optimal policy and its value function. 
  • Generalized policy iteration (GPI): alternate between some steps of policy evaluation and some steps of policy improvement. 
  • Value iteration (VI): Directly iteratively update the Bellman equation for the optimal state value function. This update converges to the value function of the optimal policy. 
Bellman operator: Think of the Bellman equation as a (functional) operator (so called the Bellman operator) $T^{\pi}$ applied on a function $v$, 

$$(T^{\pi}v)(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s',r|s,a) (r + \gamma v_{\pi}(s') )$$
then true value function of the current policy is a fixed point of this operator: 
$$T^{\pi}V_{\pi} = V_{\pi}.$$
Similarly, we can define the Bellman optimality operator for the optimal value function. 

The existence of the fixed points (thus, the existence of the value function and the optimal value function) for these operators are due to the Banach's fixed point theorem [2], which basically says that every contraction (a mapping that moves points closer together, i.e., $d(f(x), f(y)) \leq \alpha d(x,y)$) on a complete metric space has a fixed point. 

Monte Carlo 

  • Learn from complete episodes (thus applicable only to episodic tasks)
  • No assumption about the transition nor reward models 
  • Maintain a MC estimate of the value function of a given policy (for policy evaluation). 
  • On-policy Monte Carlo control: a generalized policy iteration which alternates between Monte Carlo policy evaluation and $\epsilon$-greedy policy improvement. This control converges to the optimal action-value function. 
  • Off-policy learning: while learning for an optimal policy, the agent can follow some external sources such as external exploratory (or behaviour) policy, observations from human and other agents, experience for old policies or learn multiple policies. 

Temporal difference learning 

Temporal difference (TD) learning is central in DL. It directly learns the value function $V(s)$ from bootstrapping in a model-free, online and incremental manner:

$$V(s) \leftarrow V(s) + \alpha [ r + \gamma V(s') - V(s) ]$$

TD learning uses bootstrapping technique because it estimates the value function based on the current estimate, i.e., it does not wait for the state function to be estimated on all states, but it uses whatever the state function estimate is to update the state function at any upcoming state. This is a in-place update, similar idea of bootstrapping because it is a kind of sampling with replacement.

Similarly, Q-learning and SARSA also use bootstrapping techniques.

Q-learning: $$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s,a)]$$

SARSA: $$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s', a') - Q(s,a)]$$


References 


No comments:

Post a Comment