Thursday, May 16, 2019

RL summary - part 2


Dynamic Programming vs Temporal Diference vs Monte Carlo

  • DP is full backup:
    • Policy evaluation: $$v(s) \leftarrow \mathbb{E}[r + \gamma v(s') | s] $$
    • Q policy evaluation: $$Q(s,a) \leftarrow \mathbb{E}[r + \gamma Q(s', a') | s, a]$$ 
    • Q value iteration: $$Q(s,a) \leftarrow \mathbb{E}[r + \gamma \max_{a'} Q(s', a') | s,a]$$
  • TD is sample backup:
    • TD learning: $$v(s) \leftarrow v(s) + \alpha [r + \gamma v(s') - v(s)] $$
    • SARSA: $$Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma Q(s', a') - Q(s,a)] $$
    • Q-learning: $$Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma max_{a'} Q(s', a') - Q(s,a)]$$
  • MC is also sample backup: 
$$v(s) = \sum_{r \text{ generated at state } s \text{ within the episode}} r$$

  • DP: require model (the state transition distribution and the reward distribution), bootstrap, and full backups (one step). 
  • TD: model-free, bootstrap, and sample backup (one step or multi-steps). biased but low variance, sensitive to initialization, effective in Markov environments because it exploits the Markov structure for update, can learn online, can learn from incomplete sequence and can learn in non-episodic environments. 
  • MC: model-free, not bootstrap, sample backup, zero bias but high variance, insensitive to initialization, good convergence property, more effective in non-Markov environments, only learn from complete episodes in episodic environments. 

No comments:

Post a Comment