Below are common terminologies found in reinforcement learning domain. Definitions are not formal and taken mostly from Sutton’s “Reinforcement Learning: An Introduction 2nd Ed” book [1].

Equations are numbered using the same number as in the Sutton’s book to make it easier to find.

This is a work in progress.

# Introduction

Reinforcement learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. (from [1])

In RL, a learner (or is called an agent in RL terminology) is placed in a poorly understood, possibly stochastic and nonstationary environment. The learner interacts with the environment at discrete time steps. At every time step, the learner can observe and change the environment’s state through its actions. In addition to state changes, the environment responds to the learner’s actions with a reward, a scalar quantity that represents the immediate utility of taking a given action in a given state. The learner’s objective is to develop a policy (a mapping from states to actions) that maximizes its long-term return. (from [2])

The main elements of RL are the agent and the environment. Beyond that, the main subelements of RL are a policy, a reward signal, a value function, and, optionally, a model of the environment.

agent
The learner/decision maker being trained.
environment
The world outside in agent.
policy (π)
The agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.

An example of policy is the greedy policy, which simply chooses action with the highest estimated action value at every state.

π
Notation for policy.
action
Action, decision that the agent makes.
state
State St is the representation of environment at time t as captured by agent. This state captures all relevant information from the history.
experience
Sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
reward
Scalar reward Rt+1 that the agent receives from the environment after taking an action. Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run
reward function
Reward function is a function r(s,a) that produces the expected reward when taking action a in state s:

$r(s,a) = \mathbb{E} [ R_{t} \ | \ S_{t-1} = s, A_{t-1}=a ]$     (Eq. 3.5)

return
The return Gt is the total discounted expected reward that the agent receives from time-step t until the terminated state T:

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma ^ 2 R_{t+3} + ... = \sum_{k=0}^{T} \gamma ^k R_{t+k+1}$     (Eq. 3.8)

Relating to return at successive time:

$G_t = R_{t+1} + \gamma G_{t+1}$     (Eq. 3.9)

discount rate (γ)
Discount rate, represented by γ, 0 <= γ <= 1, determines how much future return is worth now. The concept is reward that is received in future time steps is worth less than the same reward that is received now.
value functions
Functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state), in terms of future rewards that can be expected (i.e. the expected return). Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run.

There are two kinds of value functions:

state value function (v(s))
This is one of value functions that is a function of states (v(s)) that estimates how good it is for the agent to be in a given state, in terms of future return that can be expected.

Advanced: see vπ (the state-value function for policy π)

action value function (q(s,a))
This is one of value functions that is is a function over state and action (q(s,a)) that estimates how good it is for the agent to perform an action in a given state, in terms of future return that can be expected.

Advanced: see qπ (the action-value function for policy π)

model
Model of the environment is something that mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave. For example, given a state and action, the model might predict the resultant next state and next reward.

Model consists of P (state transition probability) and R (reward expectation).

Models are used for planning. Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and- error learners.

distribution model
A model that knows the probability of all possible actions from any states. The other kind of model is sample model.
sample model
A model where instead of knowing all actions and states (like distribution model), it just returns a sample action every time it is asked what to do in a state.
model-based learning
Methods for solving reinforcement learning problems that use models and planning.
model-free learning
Learning without knowing the model, by trial and error. This is done by directly performing the action according to some policy and see what happens (i.e. what state the environment moves to and what reward the agent receives from the environment).
planning
A process to use a model of the environment to produce or improve the policy. The process can be diagrammed as follows:

model → simulated experiencevalues (from backup update) → policy

Advanced: there are decision-time planning and background planning.

greedy (policy)
A policy to always select action with the highest estimated action value.

This simple policy, however, suffers from exploitation-exploitation dilemma. The ε-greedy and ε-soft policies attempt to rectify this problem by allocating small probability ε to explore less optimal actions.

exploration-exploitation dilemma
This is one classic dilemma where an agent must choose between acting greedily by exploiting the most optimal actions to maximize the immediate rewards, and exploring less optimal actions in search for possible better future rewards.
ε-greedy (policies)
A policy on which the agent usually acts greedily, but there is ε chance that it acts non-greedily to explore other actions. The ε-greedy policy is an example of ε-soft policy, and is one of the solutions to the exploration-exploitation dilemma in RL.
soft policy
Policy that selects all actions in all states with nonzero probability. One example of soft policy is ε-soft policy.
ε-soft (policies)
A kind of soft policies for which $\pi(a|s) \geq \frac {\epsilon}{|\mathcal{A}(s)|}$ for all states and actions, for some ε > 0, i.e. given a probability ε > 0, then all actions from all states have at least ε chance of being selected. The ε-greedy policy is an example of ε-soft policy.

# Finite Markov Decision Processes

Finite MDP (Markov Decision Process)
MDP is formalization of how the agent interacts with the environment in reinforcement learning. The interaction is depicted by the following diagram:

MDP is represented by <S, A, P, R, γ> tuples, where:

• S: finite set of states
• A: finite set of actions, represents all the action that the agent can take
• P: state transition probability matrix, which maps all possible states after state S, along with their probability
• R: reward function, r(s,a), that represents the expected reward when taking action a in state s
• γ: discount factor, $\gamma \in [ 0, 1 ]$. Future rewards can be worth less than the reward that we receive now, so they are discounted.

If we have the MDP, then we know all the states in the environment, and the probability of all possible actions in each of these states, along with the reward and the probability of all possible next states after taking those actions (hint: we usually don’t).

transition probability matrix
State transition probability matrix maps all possible states after state S, along with their probability. It is defined as:

$p(s'|s,a) = \text{Pr} [ S_{t} = s' \ | \ S_{t-1} = s, A_{t-1}=a ]$     (Eq. 3.4)

episode
List of steps (the <S, A, R, S’> tuple) from start state (e.g. start of one game session) until terminal state (e.g. end of one game session) in an episodic task.
episodic
A type of interaction between agent and environment that has a clear start state (e.g. the start of a game session) and end state (e.g. end of one game session). The end state is also called terminal state. The list of steps between start and terminal states is called an episode.

Contrast this with continuing type of problems.

continuing (problem)
A type of interaction between agent and environment where the interaction is just going on and on without start state and terminal state, unlike episodic interaction.
vπ
The state value function vπ (v pi) is the value of a state s under a policy π, is the expected return when starting in s and following π thereafter:

$v_{\pi}(s) = \mathbb{E}_{\pi} [ G_t | S_t=s] = \mathbb{E}_{\pi} [ \sum_{k=0}^{T} \gamma^k R_{t+k+1} \ | \ S_t = s ]$     (Eq. 3.12)

qπ
The qπ(s, a) function is action-value function for policy π, is the expected return starting from state s, taking action a, and then following policy π thereafter.

$q_\pi (s,a) = \mathbb{E}_{\pi} [ G_t | S_t=s, A_t=a] \ \ \ \ \ \ \\ = \mathbb{E}_{\pi} [ \sum_{k=0}^{T} \gamma^k R_{t+k+1} | S_t=s, A_t=a]$     (Eq. 3.13)

optimal-policy (π)
The optimal-policy, denoted by π, is one or more policies that gives return better than or equal to all other policies
q*
See optimal action-value function
optimal action-value function (q*)
One or more action-value functions that are better than all others.

$q_*(s,a) = \text{max}_\pi q_\pi (s,a)$     (Eq. 3.16)

v*
See optimal state-value function
optimal state-value function (v*)
One or more state-value function that is better than all others.

$v_*(s) = \text{max}_\pi v_\pi (s)$     (Eq. 3.15)

Bellman expectation equations
For vπ:

$v_{\pi} = \mathbb{E} [ R_{t+1} + \gamma v_{\pi} (S_{t+1}) \ | \ S_t = s ]$     (Eq. 3.14)

For qπ:

$q_{\pi} (s, a) = \mathbb{E} [ R_{t+1} + \gamma q_{\pi} (S_{t+1}, A_{t+1}) | S_t = s, A_t = a ]$     (Eq. 3.13)

Bellman optimality equations
(1) For v*, the value is simply the action-value if we take the optimal action:

$v_*(s) = \max _ a q_*(s,a)$     (Eq. 3.9)

(1) For q*, the value is the immediate reward plus discounted proportionally averaged value of all posible states we can land on after taking the action:

$q_*(s,a) = R_s^a + \gamma \sum_{ s' \in S}^{} P_{ss'}^a v_*(s')$

(2) Plugging the q* equation in (1), we can calculate v*:

$v_*(s) = \max _ a R_s^a + \gamma \sum_ { s' \in S} P_{ss'}^a v_*(s')$     (Eq. 3.19)

(2) And for q*:

$q_*(s,a) = R_s^a + \gamma \sum_ { s' \in S} P_{ss'}^a \max _{a'} q_*(s', a')$     (Eq. 3.20)

# Dynamic Programming

dynamic programming (DP)
A collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP). With dynamic programming (DP), the agent has to visit all states and perform all possible actions on each of these states in order to calculate each and all of the state values and action values and find the optimal policy.

DP is almost never used because they need a perfect model and because they require great computational expense. They are presented here because they provide an essential foundation theories.

policy evaluation (=prediction problem)
Iteratively compute vπ, the state-value function for policy π, for all states, until their values converge according to the policy.
policy improvement
Establishing new policy that selects new action which yields higher return than previous policy.
policy iteration
Doing the following tasks in a loop:

1. Perform policy evaluation to calculate value functions for all states based on current policy.
2. Perform policy improvement based on the new values to refine the policy.

When policy evaluation is stopped after just one sweep (one update of each state), this algorithm is called value iteration.

value iteration
A type of policy iteration where policy improvement is performed after just one sweep of policy evaluation (one update of each state).
asynchronous dynamic programming
A kind of dynamic programming where instead of sweeping through all possible transitions, the algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available. The values of some states may be updated several times before the values of others are updated once. To converge correctly, however, an asynchronous algorithm must continue to update the values of all the states: it can’t ignore any state after some point in the computation.

Example of an asynchronous DP algorithm is real-time dynamic programming.

generalized policy iteration (GPI)
A general idea of having an approximate value function and an approximate policy, and continually try to improve each on the basis of the other by means of iterations.
online (method of calculation)
The calculation of value functions can be performed at each timestep, i.e. doesn’t have to wait until the end of an episode. As opposed to offline.
offline (method of calculation)
The calculation of value functions can only be performed once an episode completes. As opposed to online.
update
Operation to shift the value function at particular state (or state-action) toward a “backed-up value” or update target for that state (or state-action). The general form of update rule is:

NewEstimate ← OldEstimate + StepSize [ Target − OldEstimate ]     (Eq. 2.4)

Based on the scope of nodes that are involved in the update, there are two kinds: sample update and expected update.

Advanced: a notation s → u is also used, where s is the state updated and u is the update target that s’s estimated value is shifted toward. The following are updates for common RL methods:

• Monte Carlo update: St → Gt
• TD(0) update: St → Rt+1 + γvˆ(St+1,wt)
• n-step TD update: St → Gt:t+n
• DP update: s → Eπ[Rt+1 + γvˆ(St+1,wt) | St = s]
expected update
In a one-step update, all possible next events (states/actions) are considered when calculating current state/action value. The values of next events are weighted according to their probability of occuring. Expected update thus requires a distribution model. Contrast this with sample update.
sample update
In a one-step update, only one next state/action value (taken from experience) is considered for calculating current state/action value. Contrast this with expected update.
bootstrapping
Estimate the values of states based on estimates of the values of successor states.
backward view
The agent only needs to look at current and prior states/values in order to perform update. Opposite of forward view.
forward view
The agent needs to look at few states after current state St in order to perform calculation on St. In practical term, the agent needs to wait until St+n before it can calculate St. See also backward view above.
credit assignment problem
If we get a final reward R after doing actions a1, a2, a3 and visiting states s1, s2, s3, which actions/states should we attribute the reward to. (Note: the eligibility trace attempts to solve this)

# Monte Carlo

Monte Carlo (MC)
Monte-Carlo is a model free, sample based methods to solve the RL problem based on averaging sample returns. The experience is divided into episodes, where each episode contains time steps/actions. When an episode completes (therefore MC is an offline learning method), value estimates and policies are updated by averaging the sampled returns and the previous average return:

$V(S_t) \leftarrow V(S_t) + \alpha ( G_t - V(S_t))$     (Eq. 6.1)

where V(St) is the average return of state St. The equation above is for every-visit Monte Carlo method.

The value function for policy π is the expected return on the state:

$v_{\pi}(s) = \mathbb{E}_{\pi} [ G_t | S_t = s]$     (Eq. 6.3)

sample based
Taking actual action in an environment and see what happens, instead of sweeping/knowing all possible actions and rewards.
on-policy (learning)
“Learn on the job”. The agent is following the same policy as the policy it is evaluating. Learn about policy π from experience sampled from π.

off-policy (learning)
The agent is following a policy (called behavior policy) that is different than the policy it is evaluating (called target policy). It learns about policy π from experience sampled from μ. It is learning by “looking over someone’s shoulder”.

Off-policy methods are more powerful and general, but are often of greater variance and are slower to converge. See also on-policy learning.

behavior policy
The policy that is used by an agent to choose what action to do in a particular state. Contrast this with target policy. If the behavior and target policies are the same then it is called on-policy learning, and if they are different it is called off-policy learning.
target policy
The policy that is being evaluated, i.e. its value functions are being calculated, and improved during learning. Contrast this with behavior policy. If the behavior and target policies are the same then it is called on-policy learning, and if they are different it is called off-policy learning.

# Temporal-Difference Learning

temporal difference (TD) learning
Like Monte-Carlo, it is a model free, sample based methods to solve the RL problem based on averaging sample returns, but using bootstrapping to estimate the value. The experience is divided into episodes, where each episode contains time steps/actions. Unlike MC, TD does not have to wait until the episode completes before making an update (therefore TD is an online algorithm).

On every time step, the algorithm updates the previous state/state action with the reward just received and whatever estimation of state/action value function of current state. If the state updated is the exact previous state, the algorithm is called TD(0), while if the state updated is nth steps before, the algorithm is called n-step TD (advanced).

TD(0) [one-step TD]
A kind of TD learning that updates the exact previous state at every time step with the following update:

$V(S_t) \leftarrow V(S_t) + \alpha [ R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$     (Eq. 6.2)

As a side note, the part inside the square bracket above is called TD-error.

$v_{\pi}(s) = \mathbb{E}_{\pi} [ R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s]$     (Eq. 6.4)

TD-error
TD-error, represented by δt, is the error in the estimation at a given timestep, i.e. the difference between current estimation and the actual reward and discounted estimation value of next state. Because of this, TD-error is only available one step later.

$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$     (Eq. 6.5)

Sarsa
Sarsa is an on-policy TD control. It’s named after sequence of S, A, R, S’, A’.

The update rule for one-step Sarsa is:

$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t)]$     (Eq. 6.7)

This update is done after every transition from a nonterminal state St. If St+1 is terminal, then Q(St+1 , At+1 ) is defined as zero.

Algorithm for Sarsa:


Initialize Q(s,a) arbitrarily, Ɐs ϵS, aϵA(s), and Q(terminal state,·)=0
Repeat (for each episode):
Initialize S
Choose A from S using policy derived from Q (e.g. ϵ-greedy)
Repeat (for each step in episode):
Take action A, observe R, S'
Choose A' from S' using policy derived from Q (e.g. ϵ-greedy))
Q(S,A) <-- Q(S,A) + α[ R + γ Q(S',A') - Q(S,A) ]
S <-- S'; A <-- A'
until S is terminal

Q-learning
Q-learning is an off-policy TD control algorithm, defined by:

$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma \text{max}_a Q(S_{t+1},a) - Q(S_t,A_t)]$     (Eq. 6.8)

In this case, the learned action-value function, Q, directly approximates q*, the optimal action-value function, independent of the policy being followed. This dramatically simplifies the analysis of the algorithm and enabled early convergence proofs.

Q-Learning algorithm:


Initialize Q(s,a) arbitrarily, Ɐs ϵS, aϵA(s), and Q(terminal state,·)=0
Repeat (for each episode):
Initialize S
Repeat (for each step in episode):
Choose A from S using policy derived from Q (e.g. ϵ-greedy)
Take action A, observe R, S'
Q(S,A) <-- Q(S,A) + α[ R + γ maxa Q(S',a) - Q(S,A) ]
S <-- S'
until S is terminal

expected Sarsa
Just like Sarsa, except that instead of the maximum over next state–action pairs it uses the expected value, taking into account how likely each action is under the current policy:

$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \ [R_{t+1} + \gamma \mathbb{E}[ Q(S_{t+1},A_{t+1}) | S_{t+1}] - Q(S_t,A_t)]$

$\leftarrow Q(S_t,A_t) + \alpha \ [R_{t+1} + \gamma \sum_a \pi (a | S_{t+1}) Q(S_{t+1},a) - Q(S_t,A_t)]$     (Eq. 6.9)

Expected Sarsa is more complex computationally than Sarsa, but generally performs better on the same amount of experience.

maximization bias
Tendency (bias) to select the maximum of the estimated value. For example, given two possible actions, action right which always gives zero, and action left with reward drawn from normal distribution with mean -0.1 and variance 1.0, the agent will choose left because of maximization bias.

The double learning method below can be used to eliminate this.

double learning
A learning method to remove maximization bias by dividing the experience in two sets and use them to learn two independent estimates, call them Q1(a) and Q2(a), and use them separately for estimation and control.

# n-step Bootstrapping

n-step TD
A kind of TD learning that updates the previous nth state at every time step:

$V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha [ G_{t:t+n} - V_{t+n-1}(S_t)], \ \ \ 0 \leq t < T$     (Eq. 7.2)

where Gt:t+n is the n-step return as explained below.

When n is 0, the algorithm is called TD(0) which has been explained above.

n-step return
The sum of the first n rewards plus the estimated value of the state reached in n steps, each appropriately discounted.

$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n} V_{t+n-1} (S_{t+n})$     (Eq. 7.1)

n-step Sarsa
n-step method for Sarsa:

$Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [ G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], 0 \leq t < T$     (Eq. 7.5)

where:

$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n} Q_{t+n-1} (S_{t+n}, A_{t+n})$     (Eq. 7.4)

n-step off-policy TD with importance sampling
Applying off-policy learning to n-step TD (including n=1). This is accomplished by scaling the update with importance sampling or by using tree-backup algorithm (performing expected update to nodes in the episode).

$V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha \rho_{t:t+n-1} [ G_{t:t+n} - V_{t+n-1}(S_t)], 0 \leq t < T$     (Eq. 7.7)

where $\rho_{t:h}$ is called importance sampling and is explained below.

importance sampling
A general technique for estimating expected values under one distribution given samples from another, as in off-policy learning methods. When estimating vπ(s) (i.e. state value of s under policy π) from sample of returns taken under behavior policy b, we scale the return with a importance-sampling ratio ρt:T −1:

$\rho_{t:h}=\prod_{k=t}^{min(h,T-1)} \frac {\pi(A_k\ |\ S_k)} {b(A_k\ |\ S_k)}$     (Eq. 7.8)

There are two ways to do importance sampling: ordinary importance sampling and weighted importance sampling. In practice, the weighted importance sampling usually has dramatically lower variance and is strongly preferred.

Per-decision off-policy methods with control variates
TBD.
n-step tree backup algorithm
TBD.

# Planning and Learning with Tabular Methods

direct reinforcement learning
Using the experience to find value functions/policy. This is what most RL algorithms do (e.g. Monte Carlo or TD). Contrast this with model learning where the experience is used to learn a model. In Dyna, direct-RL and model-learning is used simultaneously.
model learning
Sometimes called indirect reinforcement learning, this is a method to use the experience to learn a model. Contrast this with direct reinforcement learning where the experience is used to learn the value functions/policy. In Dyna, direct-RL and model-learning is used simultaneously.
indirect reinforcement learning
See model learning.
Dyna
A blend of model-based and model-free learning learning, where the actual experience is used not only to find value or policy (the so called direct RL) but also to learn the model (i.e. model learning), and then using that model to generate more experience to train the agent.
prioritized sweeping
Perform updates more effectively in Dyna by generating transitions by focusing on transitions that were most affected (i.e. which values are likely to change the most) by the recent update.
trajectory sampling
Distribute updates by sampling from the state (or state–action) space according to some distribution rather than sweeping through the entire state (or state action) as in dynamic programming.
real-time dynamic programming (RTDP)
RTDP is an on-policy trajectory-sampling version of DP’s value-iteration algorithm. RTDP is an example of an asynchronous DP algorithm.
decision-time planning
Using simulated experience or to consult a model to select an action for the current state. As opposed to background planning where the simulated experience or model is used to improve value functions for all/many states not just current one.
background planning
Using simulated experience and feed it to the learning algorithm to gradually improve a policy or value function for all/many states, just just current one.As opposed to decision-time planning.
Decision-time planning methods to pick the best action, for each state encountered, considering all possible continuations.
rollout algorithms
Decision-time planning algorithms based on Monte Carlo control that generate simulated episodes that start with each possible actions in current state and estimate the action value by averaging them. An action then is chosen, new state is entered, and the process starts again. An example of rollout algorithm is Monte Carlo Tree Search (MCTS)
A kind of rollout algorithm.

# On-policy Prediction with Approximation

function approximation
Since the number of states (or state-action) can be very large, it’s often not feasible to store the value-function directly into each state or state-action (the so called tabular method). Therefore a model to approximate the state value function (or action value function) is needed.
There are two kinds of function approximation:

One popular method to train function approximation is using stochastic gradient descent (SGD).

state value function approximation
Value approximation for state value. Several kinds of such model can be used, such as:

Training the model is a similar process as in supervised learning, and in this case the prediction objective (VE) function is often used as the objective function for the training. One most widely used training method is stochastic gradient descent (SGD).

action-value approximation
Approximating the action value with parametric approximation function q^(s, a, w) ≈ q*(s, a), where w ∈ Rd is a finite-dimensional weight vector.
One popular method to find action-value approximation is using stochastic gradient descent (SGD) for action value.
prediction objective (VE)
The prediction objective (VE) is the objective function for learning the state value approximation. It is a Mean Squared Value Error that represents the difference between predicted value by state value approximation and the actual value. The formula is:

$\overline{VE}(w) = \sum_{s \epsilon S} \mu(s)\ [v_\pi(s) - \hat{v}(s,w)]^2$     (Eq. 9.1)

where w is the weight vector for the value-function approximation.
The square root of this measure, the root VE, gives a rough measure of how much the approximate values differ from the true values and is often used in plots.
The µ(s) is the state distribution.

state distribution µ(s)
State distribution µ(s) is used in prediction objective (VE) calculation to represent how much we care about the error in each state s. For each state, µ(s) ≥ 0, and the sum of µ(s) for all states: ∑s µ(s) = 1. Often µ(s) is chosen to be the fraction of time spent in s.
Under on-policy training this is called the on-policy distribution. In continuing tasks, the on-policy distribution is the stationary distribution under π. In episodic tasks, the on-policy distribution is calculated as follows.
Let h(s) denote the probability that an episode begins in each state s, and let η(s) denote the number of time steps spent, on average, in state s in a single episode. Time is spent in a state s if episodes start in s, or if transitions are made into s from a preceding state in which time is spent:

$\eta(s) = h(s) + \sum_{\overline{s}} \eta(\overline{s}) \sum_a \pi(a|\overline{s}) \ p(s|\overline{s},a),\ \text{for all }s\epsilon S$     (Eq. 9.2)

This system of equations can be solved for the expected number of visits η(s). The on-policy distribution is then the fraction of time spent in each state normalized to sum to one:

$\mu(s) = \frac {\eta(s)} {\sum_{s'} \eta(s')} \text{, for all } s \epsilon S$     (Eq. 9.3)

Note that discounting can also be used.

SGD methods are among the most widely used methods to find approximation function and are particularly well suited to online reinforcement learning.
In gradient-descent methods, the weight vector is a column vector with a fixed number of real valued components, w = (w1, w2, . . . , wd) (or w ∈ Rd), and the state value approximation v^(s,w) (or action value approximation q^(s,a,w)) is a differentiable function of w for all s ∈ S.
We will be updating w at each of a series of discrete time steps, t = 0, 1, 2, 3, …, so we will need a notation wt for the weight vector at each step.
There are two kinds of SGD methods:

SGD for action value approximation
Stochastic gradient descent for finding action value approximation.
On each step, we observe a new example St, At → Ut, where Ut can be any approximation of qπ(St, At) such as Monte Carlo return (Gt) or any of the n-step Sarsa returns (see eq 7.4).
The general SGD update for action value prediction is:

$w_{t+1} = w_t + \alpha [U_t - \hat{q}(S_t,A_t,w_t)] \triangledown \hat{q}(S_t,A_t,w_t)$     (Eq. 10.1)

For example, SGD update for one-step Sarsa method is:

$w_{t+1} = w_t + \alpha [R_{t+1} + \gamma \hat{q}(S_{t+1},A_{t+1},w_t) - \hat{q}(S_t,A_t,w_t)] \triangledown \hat{q}(S_t,A_t,w_t)$     (Eq. 10.2)

The equation above is called episodic semi-gradient one-step Sarsa.
The SGD update for n-step semi-gradient Sarsa:

$w_{t+n} = w_{t+n-1} + \alpha [G_{t:t+n} - \hat{q}(S_t,A_t,w_{t+n-1})] \triangledown \hat{q}(S_t,A_t,w_{t+n-1}), \ 0 \leq t < T$     (Eq. 10.5)

SGD for state value approximation
Stochastic gradient descent for finding state value approximation.
On each step, we observe a new example St → vπ(St) (see update) consisting of a (possibly randomly selected) state St and its true value under the policy.
Assume that states appear in samples with the same distribution, µ, over which we are trying to minimize the prediction objective VE.
The SGD update is then defined as follows:

$w_{t+1} = w_t - \frac{1}{2} \alpha \triangledown [v_\pi(S_t) - \hat{v}(S_t,w_t)]^2$     (Eq. 9.4)

$= w_t + \alpha [v_\pi(S_t) - \hat{v}(S_t,w_t)] \triangledown \hat{v}(S_t,w_t)$     (Eq. 9.5)

Since we don’t know the true value vπ(St), we replace it with Ut, the approximation to it, which might be a noise-corrupted version of vπ(St), or it might be one of the bootstrapping targets using v^.
This yields the following general SGD method for state-value prediction:

$w_{t+1} = w_t + \alpha [U_t - \hat{v}(S_t,w_t)]\ \triangledown \hat{v}(S_t,w_t)$     (Eq. 9.7)

If Ut is an unbiased estimate (e.g. when Ut = Gt from Monte Carlo target), then SGD is guaranteed to converge to a local optimum for decreasing α.
The same guarantee cannot be made if bootstrapping estimate is used as the target Ut. Bootstrapping methods are not in fact instances of true gradient descent. They are called semi-gradient methods.

linear value-function approximation
One of the most important special cases of function approximation in which the approximate function, v^(·,w), is a linear function of the weight vector, w. Corresponding to every state s, there is a real-valued vector x(s) = (x1(s), x2(s), . . . , xd(s)), with the same number of components as w.
The linear function approximation is defined as:

$\hat{v}(s,w) = w^T.x(s) = \sum_{i=1}^{d} w_i.x_i(s)$     (Eq. 9.8)

Since

$\triangledown \hat {v}(s,w) = x(s)$,

the SGD for linear function approximation is then:

$w_{t+1}= w_t + \alpha [U_t - \hat{v}(S_t,w_t)] x(S_t)$,

TD fixed point (WTD)
TD fixed point, WTD, is a weight vector where semi-gradient state value approximation using TD(0) converges to.
TD fixed point is defined as:

$W_{TD} = A^{-1}b$     (Eq. 9.12)

where:

$b \doteq \mathbb{E}[R_{t+1} \ x_t] \ \epsilon \ \mathbb{R} \text{ and } A \doteq \mathbb{E}[x_t(x_t - \gamma x_{t+1})^T] \ \epsilon \ \mathbb{R}^d \times \mathbb{R}^d$     (Eq. 9.11)

TD fixed point can be found by using iteration or directly using Least-Squares TD (LSTD) method.

Least-Squares TD (LSTD)
Method to directly compute TD fixed point by computing the estimates for A and b as follows:

$\hat{A_t} = \sum_{k=0}^{t-1} x_k (x_k - \gamma x_{k+1})^\top + \epsilon I \ \text{ and } \ \hat{b}_t = \sum_{k=0}^{t-1} R_{t+1} x_k$     (Eq. 9.20)

where I is the identity matrix, and εI, for some small ε > 0, ensures that At is always invertible.
LSTD estimates the TD fixed point as:

$w_t = \hat{A_t}^{-1} \hat{b_t}$     (Eq. 9.21)

This algorithm is the most data efficient form of linear TD(0), but it is also more expensive computationally.

In SGD for finding the weight for state value approximation, bootstrapping methods such as TD are not instances of true gradient descent because their target value Ut depends on the weight vector w which values are being learned by the SGD.
They take into account the effect of changing the weight vector wt on the estimate, but ignore its effect on the target. They include only a part of the gradient and, accordingly, they are called semi-gradient methods.
Although semi-gradient (bootstrapping) methods do not converge as robustly as gradient methods, they do converge reliably in important cases such as the linear case.
Semi-gradient method using TD(0) target as the target update ($U_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t)$).
Plugging into equation 9.7, the formula to update the weight hence becomes:

$w_{t+1} = w_t + \alpha [R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t,w_t)] \triangledown \hat{v}(S_t,w_t)$

# On-policy Control with Approximation

SGD for action-value prediction
The general gradient-descent update for action-value prediction is:

$w_{t+1} = w_t + \alpha [U_t - \hat{q}(S_t,A_t,w_t)] \triangledown \hat{q}(S_t,A_t,w_t)$     (Eq. 10.1)

where Ut can be any approximation of qπ(St, At), including the usual backed-up values such as the full Monte Carlo return (Gt) or any of the n-step Sarsa returns (7.4)

For example, next is the update for the one-step Sarsa method.

SGD for action-value update for one-step Sarsa method is:

$w_{t+1} = w_t + \alpha [R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, w_t) - \hat{q}(S_t,A_t,w_t)] \triangledown \hat{q}(S_t,A_t,w_t)$     (Eq. 10.2)

The equation above is called episodic semi-gradient one-step Sarsa.

The n-step update equation is:

$w_{t+n} = w_{t+n-1} + \alpha [G_{t:t+n} - \hat{q}(S_t,A_t,w_{t+n-1})] \triangledown \hat{q}(S_t,A_t,w_{t+n-1})$     (Eq. 10.5)

with the return is defined as:

$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n} \hat{q}{t+n-1} (S_{t+n}, A_{t+n}, w_{t+n-1}), \ \ t+n < T$     (Eq. 10.4)

with Gt:t+n = Gt if t + n ≥ T, as usual.

ergodicity
An MDP is said to have ergodicity property, if you select actions according to π, it remains in the same distribution. It means that where the MDP starts or any early decision made by the agent can have only a temporary effect; in the long run the expectation of being in a state depends only on the policy and the MDP transition probabilities.
average reward r(π)
This is the third classical setting—alongside the episodic and discounted settings—for formulating the goal in Markov decision problems (MDPs). Like the discounted setting, the average reward setting applies to continuing problems, problems for which the interaction between agent and environment goes on and on forever without termination or start states. Unlike that setting, however, there is no discounting—the agent cares just as much about delayed rewards as it does about immediate reward.

The average reward r(π) is the average rate of reward while following that policy, and is defined as:

$r(\pi) = \lim_{h \rightarrow \infty } \frac{1}{h} \sum_{t=1}^{h} \mathbb{E} [ R_t \mid A_{0:t-1} \sim \pi ],$

$r(\pi) = \lim_{h \rightarrow \infty } \mathbb{E} [ R_t \mid A_{0:t-1} \sim \pi ],$

$r(\pi) = \sum_s \mu(s) \sum_a \pi(a \mid s) \sum _{s',r} p(s',r \mid s,a) \ r$     (Eq. 10.6)

differential return
In the average-reward setting, returns are defined in terms of dierences between rewards and the average reward:

$G_t = R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + ...$     (Eq. 10.8)

The corresponding value functions are called differential value functions and have the same notations (e.g. vπ(s) = Eπ[Gt | St=s], qπ(s) = .., and similarly for v* and q*).

For the Bellman equations, simply remove all γ and replace all rewards by the difference between the reward and the true average reward.

This is the average-reward version of semi-gradient Sarsa:

$w_{t+1} = w_t + \alpha \delta _t \triangledown \hat {q} (S_t, A_t, w_t)$     (Eq. 10.11)

The equation above is similar to Eq. 10.2, but the TD-error is updated to use the differential version:

$\delta_t = R_{t+1} - \overline{R}_{t+1} + \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t)$     (Eq. 10.9)

and

$\delta_t = R_{t+1} - \overline{R}_{t+1} + \hat{q}(S_{t+1}, A_{t+1}, w_t) - \hat{v}(S_t, A_{t+1}, w_t)$     (Eq. 10.10)

where $\overline{R}$ is an estimate at time t of the average reward r(π).

TBD.

# Off-policy Methods with Approximation

TBD.
emphatic TD methods
TBD.

# Eligibility Traces

n-step return with approximation
The general form of n-step return that uses approximation:

$G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n} \hat{v} (S_{t+n}, w_{t+n-1})$     (Eq. 12.1)

λ-return (lambda-return)
λ-return is calculated as the average of all n-step returns:

$G_t^{\lambda} \doteq (1 - \lambda) \sum_{n=1}^{\infty} \lambda ^ {n-1} G_{t:t+n}$     (Eq. 12.2)

or if the last element is treated separately:

$G_t^{\lambda} = (1 - \lambda) \sum_{n=1}^{T-t-1} \lambda ^ {n-1} G_{t:t+n} \ \ + \lambda ^ {T-t-1} G_t$     (Eq. 12.3)

offline λ-return algorithm
A learning algorithm based on λ-return:

$w_{t+1} \doteq w_t + \alpha \left[ G_{t}^{\lambda} - \hat{v}(S_t,w_t) \right] \triangledown \hat{v}(S_t,w_t), \ \ \ t=0,...,T-1$     (Eq. 12.4)

eligibility trace
The eligibility trace is a vector zt ϵ Rd with the same number of components as the weight vector wt, and is calculated as follows:

$z_{t} \doteq \gamma \lambda z_{t-1} + \triangledown \hat{v}(S_t,w_t), \ \ 0 \leq t \leq T$     (Eq. 12.5)

The eligibility trace is used to keep track of which components of the weight vector have contributed, positively or negatively, to recent update. It is set to zero at the beginning of an episode (z-1 = 0), then updated by the gradient on each time step, then fades away by γλ.

This eligibility trace is also called accumulating trace, to distinquish it from other kind of eligibility trace such as the dutch trace defined by eq. 12.11.

TD(λ)
An offline TD(λ) algorithm is defined as follows:

$w_{t+1} \doteq w_t + \alpha \delta_t z_t$     (Eq. 12.7)

In the equation above, the weight vector is updated at each step proportional to the eligibility trace vector and the scalar TD-error below:

$\delta_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t)$     (Eq. 12.6)

truncated λ-return
The λ-return above is limited because one needs to wait until the end of episode. On the other hand, the reward gets weaker as the episode gets longer due to γλ. Therefore it seems like a good idea to truncate the reward of this long episode down to certain length.

The truncated λ-return for time t, given data only up to some later horizon, h, is defined as as:

$G_{t:h}^{\lambda} \doteq (1 - \lambda) \sum_{n=1}^{h-t-1} \lambda ^ {n-1} G_{t:t+n} \ \ + \lambda ^ {h-t-1} G_{t:h}$     (Eq. 12.9)

TTD(λ) – truncated TD(λ)
The truncated TD(λ) or TTD(λ) is defined as:

$w_{t+n} \doteq w_{t+n-1} + \alpha \left[ G_{t:t+n}^{\lambda} - \hat{v}(S_t, w_{t+n-1}) \triangledown \hat{v}(S_t, w_{t+n-1}) \right]$

online λ-return
The online λ-return is currently the best performing temporal-difference algorithm. The idea is that, on each time step as a new increment of data is gathered, you go back and redo all the updates since the beginning of the current episode. The new updates will be better than the ones you previously made because now they can take into account the time step’s new data.

TBD.

true online TD(λ)
The true online TD(λ) for linear case in which $\hat{v} (s,w) = w^{\top} x(s)$:

$w_{t+1} \doteq w_{t} + \alpha \delta_t z_t + \alpha ( w_t^\top x_t - w_{t-1}^\top x_t ) (z_t - x_t)$

where δt is defined as in TD(λ) in eq. 12.6 and zt is defined by:

$z_t \doteq \gamma \lambda z_{t-1} + ( 1 - \alpha \gamma \lambda z_{t-1}^{\top} x_t)\ x_t$     (Eq. 12.11)

This algorithm has been proven to produce exactly the same sequence of weight vectors as the on-line λ-return algorithm above.

The eligibility trace used in eq 12.11 above is called dutch trace, to distinquish from the trace (eq 12.5) used in TD(λ), which is called accumulating trace.

offline λ-return for action value
The action-value form of the off-line λ-return algorithm (12.4) simply uses q^ rather than v^:

$w_{t+1} \doteq w_t + \alpha \left[ G_{t}^{\lambda} - \hat{q}(S_t,A_t,w_t) \right] \triangledown \hat{q}(S_t,A_t,w_t), \ \ \ t=0,...,T-1$     (Eq. 12.15)

where $G_t^{\lambda} \doteq G_{t:\infty}^{\lambda}$

Sarsa(λ)
Sarsa(λ) has the same update rule as given earlier for TD(λ):

$w_{t+1} \doteq w_t + \alpha \delta_t z_t$     (Eq. 12.7)

except, naturally, using the action-value form of the TD error:

$\delta_t \doteq R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, w_t) - \hat{q}(S_t, A_t, w_t)$

and the action-value form of the eligibility trace:

$z_{-1} \doteq 0, \\ z_{t} \doteq \gamma \lambda z_{t-1} + \triangledown \hat{q}(S_t,A_t,w_t), \ \ 0 \leq t \leq T$

true online Sarsa(λ)
There is also an action-value version of our ideal TD method, the online λ-return. Just replace the state value function with action-value form of the n-step return. In the case of linear function approximation, the ideal algorithm again has an exact, efficient O(d) implementation, called True Online Sarsa(λ).

Pseudo-code is given in the book.

Watkins’s Q(λ)
Watkins’s Q(λ) extends Q-learning to eligibility traces.

TBD.

Tree-Backup(λ)
Eligibility trace version of Tree Backup.

TBD.

GTD(λ)
GTD(λ) is the eligibility-trace algorithm analogous to TDC, the better of the two state-value Gradient-TD prediction algorithms.

TBD.

GQ(λ)
GQ(λ) is the Gradient-TD algorithm for action values with eligibility traces.

TBD.

HTD(λ)
HTD(λ) is a hybrid state-value algorithm combining aspects of GTD(λ) and TD(λ).

TBD.

emphatic TD(λ)
Emphatic TD(λ) is is the extension of the one-step Emphatic-TD algorithm to eligibility traces.

TBD.

All the previous methods have learned the values of actions and then selected actions based on their estimated action values. In this chapter, we instead learn a parameterized policy that can select actions without consulting a value function.

We use the notation θ ϵ Rd’ for the policy’s parameter vector. If a method uses a learned value function as well, then the value function’s weight vector is denoted w ϵ Rd as usual, as in vˆ(s,w).

Given J(θ) defined as a performance measure with respect to the policy parameter, the policy gradient methods seek to maximize performance, so their updates approximate gradient ascent in J:

$\theta_{t+1} = \theta_t + \alpha \ \widehat{ \triangledown J(\theta _t) }$     (Eq. 13.1)

Reinforce update:

$\theta_{t+1} \doteq \theta_t + \alpha \ G_t \frac {\triangledown_{\theta} \pi (A_t | S_t, \theta_t) } { \pi (A_t | S_t, \theta_t) }$     (Eq. 13.6)

REINFORCE with baseline
Generalization of REINFORCE algorithm to include a comparison of the action value to an arbitrary baseline b(s):

$\theta_{t+1} \doteq \theta_t + \alpha\ ( G_t - b(S_t)\ ) \frac {\triangledown_{\theta} \pi (A_t | S_t, \theta_t) } { \pi (A_t | S_t, \theta_t) }$     (Eq. 13.9)

The baseline b(s) could be anything, including zero. One natural choice for the baseline is an estimate of the state value, vˆ(St,w), where w ϵ Rm is a weight vector learned by one of the methods presented in previous chapters. It also seems natural to use a Monte Carlo method to learn the state-value weights, w.

actor-critic
For a one-step actor-critic method, the update is fully online and incremental, yet avoid the complexities of eligibility traces. One-step actor–critic methods replace the full return of REINFORCE (13.9) with the one-step return (and use a learned state-value function as the baseline) as follows:

$\theta_{t+1} = \theta_t + \alpha\ \delta _t \frac {\triangledown_{\theta} \pi (A_t | S_t, \theta_t) } { \pi (A_t | S_t, \theta_t) }$     (Eq. 13.12)

where δt is:

$\delta_t = G_{t:t+1} - \hat{v} (S_t,w)$     (Eq. 13.10)

or

$\delta_t = R_{t+1} + \gamma \hat{v} (S_{t+1},w) - \hat{v} (S_t,w)$     (Eq. 13.11)

The natural state-value-function learning method to pair with this is semi-gradient TD(0). The one-step actor critic above can be extended to n-step methods and then to a λ-return algorithm.