Difference from other machine learning paradigms:

  • There is no supervisor, only a reward signal
  • Feedback is delayed, not instantaneous
  • Time really matters. Data is sequential, non i.i.d data
  • Agent’s actions affect the subsequent data it receives.

Basic elements


  • A reward \(R_t\) is a scalar feedback signal
  • Indicates how well agent is doing at step \(t\)

Reward hypothesis: all goals can be described by the maximization of expected cumulative reward (select actions to maximize total future reward)


The history is the sequence of observations, actions, rewards

\[ H_t = A_1, O_1, R_1,…,A_t, O_t, R_t \]


State is the information used to determine what happens next, so that we don’t need to go back to history everytime. Formally, state is a function of the history:

\[ S_t = f(H_t) \]

Environment state \(S_t^e\)

The environment state is the environment’s private representation. It is not usually visible to the agent.

Agent state \(S_t^a\)

The agent state is the agent’s internal representation that can be used to pick the next action

Markov state

If the state is defined in a way that have Markov property, the future is independent of the past given the present

Full observability

When the agent can directly observe environment state. Formally, this is a Markov decision process (MDP).

Partial observability

Agent state \(neq\) environment state. Formally, this is partially observable Markov decision process (POMDP)

Inside an RL agent

  • Policy: agent’s behaviour function \(a = \pi(s)\) or \( \pi(a s) \)
  • Value function: how good is each state and/or action
  • Model: agent’s representation of the environment

    \[ P^a_{ss’} = \mathcal{P}[S’=s’|S=s, A=a] and R_s^a = \mathcal{E}[R|S=s, A=a] \]

Policy based can have no value function, the same for value based RL may have no policy function

Planning vs. RL: model is known for planning, but not for RL

Planning can do look ahead search by query the model


[1] https://www.youtube.com/watch?v=2pWv7GOvuf0&t=3682s