Thursday, May 23, 2019

What are the key factors for me to overcome my limit?


  1. Experiments are extremely important to prove your points; otherwise your theorems are meaningless! Thus, the ability to deploy right experiments are extremely crucial. 
  2. Collaborations: Do not ever think that I can do research all alone (at least for now). Accept that fact that I am not ready yet for that independence and need a lots of inputs, discussion, and encouragement from advisers, colleagues, collaborators and friends. 
  3. Solving important problems. Must inspired by real-life important problems even if what I am doing is a little theoretic. I think this factor is very crucial! In that sense, should I be motivated by the problem first or motivated by approaches and find relevant problems later?
  4. Exceptional concentration on finishing the work? 

Summary on "Probabilistic Planning with Sequential Monte Carlo Methods" (ICLR2019)

  • probabilistic inference over optimal trajectories 
  • planning in continuous domains
  • fixed computational budget
  • An agent should able to predict the consequences of actions and explain how they will react. 
  • Use a model to plan future actions. Some examples for models to plan future actions are Monte Carlo Tree Search (MCTS), cross entropy methods (CEM) and iterative linear quadratic regulator (iLQR)]
  • MCTS: discrete games and known dynamics
  • CEM: assume the distribution over future optimal trajectories to be Gaussain. 
  • iLQR: assume the dynamics are locally linear-Gaussian. 

Thursday, May 16, 2019

RL summary - part 2


Dynamic Programming vs Temporal Diference vs Monte Carlo

  • DP is full backup:
    • Policy evaluation: $$v(s) \leftarrow \mathbb{E}[r + \gamma v(s') | s] $$
    • Q policy evaluation: $$Q(s,a) \leftarrow \mathbb{E}[r + \gamma Q(s', a') | s, a]$$ 
    • Q value iteration: $$Q(s,a) \leftarrow \mathbb{E}[r + \gamma \max_{a'} Q(s', a') | s,a]$$
  • TD is sample backup:
    • TD learning: $$v(s) \leftarrow v(s) + \alpha [r + \gamma v(s') - v(s)] $$
    • SARSA: $$Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma Q(s', a') - Q(s,a)] $$
    • Q-learning: $$Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma max_{a'} Q(s', a') - Q(s,a)]$$
  • MC is also sample backup: 
$$v(s) = \sum_{r \text{ generated at state } s \text{ within the episode}} r$$

  • DP: require model (the state transition distribution and the reward distribution), bootstrap, and full backups (one step). 
  • TD: model-free, bootstrap, and sample backup (one step or multi-steps). biased but low variance, sensitive to initialization, effective in Markov environments because it exploits the Markov structure for update, can learn online, can learn from incomplete sequence and can learn in non-episodic environments. 
  • MC: model-free, not bootstrap, sample backup, zero bias but high variance, insensitive to initialization, good convergence property, more effective in non-Markov environments, only learn from complete episodes in episodic environments. 

Wednesday, May 15, 2019

RL summary - Part I

While I have been working in my first PhD project on Bayesian optimization, I want to start to think about an idea for the second project even the first project has not shown any significance yet. But well, it is better to start on multiple ideas and switch between them. For the second PhD project, I want to extend the sequential decision making setting to a more complex setting: the standard Reinforcement Learning (RL). Note that even Bayesian optimization is a less complex setting of sequential decision making, it does not necessarily mean Bayesian optimization is easier than RL. Each has their own challenges.

I have been on and off in studying and updating about RL but this time I want to put it more comprehensively by forcing myself to summarise core concepts in RL. That is the purpose of this blog post: Summarising the key concepts in RL. To start off, I summarize material in [1] here.

  • Deep RL: uses neural networks to represent or approximate state representation, value function, policy, transition function or the reward function
  • Prediction in RL: basically it is a policy evaluation, i.e., given a policy, we wish to compute the state value function or the action-state value function. 
  • Control in RL: This is about finding the optimal policy to act. 
  • Planning in RL: how you plan to act is described by a model for policy. 
  • On-policy methods: simply evaluate and improve the current policy. For example, SARSA
  • Off-policy methods: Learn an optimal policy by using external knowledge such as some external behavior policy or samples other than the samples generated. For example, Q-learning
  • Exploration-exploitation dilemma: The agent needs to exploit the previously acquired best actions to maximize the rewards but at the same time might need to explore new actions to learn about the environment. 
  • Model-free methods: The agent directly learns from experience via trial and error, and the model for the state transition and reward are not known. 
  • Model-based methods: The model for the state transition and/or reward can be given or learned from experience. 
  • Online training: trained on the data acquired in sequence. 
  • Offline training: trained on the entire dataset. 
  • Bootstrapping for estimating the state value and action value: Update an estimate of the state value and action value using subsequent estimates (?). 
  • Markov Decision Process (MDP): This is a basic (and nice) environment for RL, defined by the 5-tuple $(\mathcal{S},\mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)$. Given the current state, the future situation depends only on the current state and not on the past states. With the system model (the transition probability $\mathcal{P}$ and the reward model $\mathcal{R}$) are known, this can be solved by dynamic programming (DP). 
  • Partially Observable Markov Decision Process (POMDP): In this RL environment, the agent cannot directly observe the state but rather only some observations of the state. 
  • MDP and POMDP are just two particular (and nice) RL environments. Other RL environments can be multi-armed bandit and real games. 

State value and action-state value

A state value function (at a state with a policy) is the expected accumulative discounted reward if the agent is in a some state and following some policy thereafter:

$$V_{\pi}(s) = \mathbb{E}[R_t | S_t = s]$$
where $R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$. 

A state-action value function is the state value function at a particular action: 

$$Q_{\pi}(s,a) = \mathbb{R}[R_t | S_t = s, A_t = a]. $$

With those definitions, we have the Bellman equation:

$$V_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V_{\pi}(s')]$$ 

$$V_*(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma V_*(s')] $$

$$Q_{\pi}(s,a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \sum_{a'} \pi(a' | s') Q_{\pi}(s', a')]$$

$$Q_*(s,a) = \sum_{s', r} p(s', r | s,a)[r + \gamma \max_{a'} Q_{*}(s', a')]$$

As an example, think of the shortest-path problem. With the global information about the graph, Dijkstra's algorithm is already very efficient. Without such global information, RL can learn in a model-free mode by trying different paths and learn from this experience. 

Note that the state here 

Exploration vs exploitation 

  • $\epsilon$-greedy: Select $a = arg\max_{a}Q(s,a)$ with probability $1 - \epsilon$, and select a random action with probability $\epsilon$. 
  • Other methods for balancing exploration and exploitation: optimistic initialization, upper confidence bound, probability matching, information state search

Dynamic Programming 

DP assumes the full knowledge of the transition and reward models of the MDP. 
  • Policy evaluation: Iteratively estimate the state value function of a policy using the Bellman equation. This converges to the state value function of the considered policy. 
  • Policy improvement: Given the state value of a policy, select actions greedily according to this state value function. 
  • Policy iteration (PI): The alternative between the iterative policy evaluation and the greedy policy improvement converges to an optimal policy and its value function. 
  • Generalized policy iteration (GPI): alternate between some steps of policy evaluation and some steps of policy improvement. 
  • Value iteration (VI): Directly iteratively update the Bellman equation for the optimal state value function. This update converges to the value function of the optimal policy. 
Bellman operator: Think of the Bellman equation as a (functional) operator (so called the Bellman operator) $T^{\pi}$ applied on a function $v$, 

$$(T^{\pi}v)(s) = \sum_{a} \pi(a|s) \sum_{s', r} p(s',r|s,a) (r + \gamma v_{\pi}(s') )$$
then true value function of the current policy is a fixed point of this operator: 
$$T^{\pi}V_{\pi} = V_{\pi}.$$
Similarly, we can define the Bellman optimality operator for the optimal value function. 

The existence of the fixed points (thus, the existence of the value function and the optimal value function) for these operators are due to the Banach's fixed point theorem [2], which basically says that every contraction (a mapping that moves points closer together, i.e., $d(f(x), f(y)) \leq \alpha d(x,y)$) on a complete metric space has a fixed point. 

Monte Carlo 

  • Learn from complete episodes (thus applicable only to episodic tasks)
  • No assumption about the transition nor reward models 
  • Maintain a MC estimate of the value function of a given policy (for policy evaluation). 
  • On-policy Monte Carlo control: a generalized policy iteration which alternates between Monte Carlo policy evaluation and $\epsilon$-greedy policy improvement. This control converges to the optimal action-value function. 
  • Off-policy learning: while learning for an optimal policy, the agent can follow some external sources such as external exploratory (or behaviour) policy, observations from human and other agents, experience for old policies or learn multiple policies. 

Temporal difference learning 

Temporal difference (TD) learning is central in DL. It directly learns the value function $V(s)$ from bootstrapping in a model-free, online and incremental manner:

$$V(s) \leftarrow V(s) + \alpha [ r + \gamma V(s') - V(s) ]$$

TD learning uses bootstrapping technique because it estimates the value function based on the current estimate, i.e., it does not wait for the state function to be estimated on all states, but it uses whatever the state function estimate is to update the state function at any upcoming state. This is a in-place update, similar idea of bootstrapping because it is a kind of sampling with replacement.

Similarly, Q-learning and SARSA also use bootstrapping techniques.

Q-learning: $$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s,a)]$$

SARSA: $$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s', a') - Q(s,a)]$$


References 


Saturday, May 11, 2019

Miscellaneous note #3


In research, ultimately at some points we would get into a stage in which we need to test the idea. Even if the idea seems good and workable, boiling down to its implementation to make sure each piece of codes work properly is very time consuming and sometime painful. But we must do that in anyway (either to prove that the idea works or to prove a reliable counterexample that the idea does not work and we should change it a bit). 

Tuesday, May 7, 2019

Miscellaneous note #2

A research work should be:

  • meaningful 
  • novel
  • big enough to be accumulated into a thesis. 

Sunday, April 28, 2019

Miscellaneous note #1

  • When you write down what you are learning, you will be amazed by how much you have missed the concepts and ideas otherwise. 
  • There is a boundary in human tolerance; that is, if I work too much in research without proper break, I can burn out. Thus, this is an important factor that I should take into account when I plan for any work. 
  • I do not fail, it is just my exploration. It just means that I discover that some thing does not work out that way and that I should try something else for exploration until I find something that works. And when I find something that works, I do exploitation