Difference from other machine learning paradigms:

• There is no supervisor, only a reward signal
• Feedback is delayed, not instantaneous
• Time really matters. Data is sequential, non i.i.d data
• Agent’s actions affect the subsequent data it receives.

Basic elements

Rewards

• A reward $$R_t$$ is a scalar feedback signal
• Indicates how well agent is doing at step $$t$$

Reward hypothesis: all goals can be described by the maximization of expected cumulative reward (select actions to maximize total future reward)

History

The history is the sequence of observations, actions, rewards

$H_t = A_1, O_1, R_1,…,A_t, O_t, R_t$

State

State is the information used to determine what happens next, so that we don’t need to go back to history everytime. Formally, state is a function of the history:

$S_t = f(H_t)$

Environment state $$S_t^e$$

The environment state is the environment’s private representation. It is not usually visible to the agent.

Agent state $$S_t^a$$

The agent state is the agent’s internal representation that can be used to pick the next action

Markov state

If the state is defined in a way that have Markov property, the future is independent of the past given the present

Full observability

When the agent can directly observe environment state. Formally, this is a Markov decision process (MDP).

Partial observability

Agent state $$neq$$ environment state. Formally, this is partially observable Markov decision process (POMDP)

Inside an RL agent

•  Policy: agent’s behaviour function $$a = \pi(s)$$ or $$\pi(a s)$$
• Value function: how good is each state and/or action
• Model: agent’s representation of the environment

$P^a_{ss’} = \mathcal{P}[S’=s’|S=s, A=a] and R_s^a = \mathcal{E}[R|S=s, A=a]$

Policy based can have no value function, the same for value based RL may have no policy function

Planning vs. RL: model is known for planning, but not for RL

Planning can do look ahead search by query the model

References