Reinforcement Learning Overview

1. What is Reinforcement Learning (RL)?

RL is a branch of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards.

Unlike supervised learning, RL does not require labeled input-output pairs; instead, it relies on a reward function to guide learning.

Analogy: Training a puppy—positive feedback for good actions, negative feedback for bad actions.

Applications:

Robotics (e.g., Mars rover, autonomous helicopter)

Games (chess, Go, video games)

Optimization problems (factory processes, trading)

2. Key RL Concepts

States (S)

Represent all possible situations an agent can be in.

Examples:

Mars rover: positions 1–6

Helicopter: position, orientation, velocity

Chess: board configuration

Actions (A)

Choices available to the agent in a given state.

Examples:

Rover: move left or right

Helicopter: move control sticks

Chess: legal moves

Rewards (R)

Feedback received after taking an action.

Examples:

Rover: 100 for leftmost state, 40 for rightmost, 0 elsewhere

Helicopter: +1 for stable flight, -1000 for crash

Chess: +1 win, 0 tie, -1 loss

Discount Factor (γ)

Weight for future rewards relative to immediate ones.

Values: 0 < γ ≤ 1

Lower γ → agent prefers immediate rewards

Higher γ → agent values long-term rewards

Example: γ = 0.5 (rover), γ = 0.99 (helicopter/chess)

Return (G)

Total discounted sum of future rewards:

$G = R_1 + γ R_2 + γ^2 R_3 + …$

Determines which action sequences are “better” in the long run.

Example for Mars rover at state 4:

Move left: G = 12.5

Move right: G = 10

Observation: Return depends on actions chosen and starting state.

Policy (π)

Rule that maps states to actions to maximize return.

Can be deterministic (fixed action per state) or stochastic (probabilistic).

3. Example: Mars Rover RL Problem

States: 6 positions

Actions: left or right

Rewards: 100 (state 1), 40 (state 6), 0 elsewhere

Discount factor: γ = 0.5

Return calculation: Weighted sum of future rewards

Goal: Choose actions to maximize cumulative reward

4. Generalization to Other Applications

Autonomous Helicopter

States: Position, orientation, speed

Actions: Move control sticks

Rewards: +1 for stable flight, -1000 for crash

Policy: Chooses action based on helicopter state

Game Playing (Chess)

States: Board configuration

Actions: Legal moves

Rewards: +1 win, 0 tie, -1 loss

Policy: Selects best move for each board position

5. Markov Decision Process (MDP)

Formalism for RL problems

Markov Property: Future depends only on current state, not past history

Components: States, actions, rewards, transition dynamics, policy

Representation: Diagram showing states, actions, resulting states, and rewards

6. State-Value (V) and State-Action Value (Q) Functions

V(s): How good it is to be in state s

Q(s,a): How good it is to take action a in state s

Optimal action: Action with highest Q(s,a)

Example Q-values:

Action Q(s,a)
Left -10
Right -20
Stop 0

Optimal action = Stop (highest Q-value)

Action	Q(s,a)
Left	-10
Right	-20
Stop	0

7. Bellman Equation and Learning Q-values

Bellman equation allows creation of supervised-learning-like targets:

$Q(s,a) = R(s,a) + γ \max_{a'} Q(s',a')$

Enables Q-learning and deep Q-learning (DQN)

8. Continuous vs. Discrete MDPs

Discrete: Finite number of states and actions (Mars rover example)

Continuous: State variables can take infinite values (Lunar Lander, helicopter)

9. Practical Tips

Use environments like OpenAI Gym for experimentation

Calculate returns carefully with discount factors

Understand Q-values and Bellman updates for RL algorithms

✅ Conclusion

RL is about learning to act optimally through experience and rewards.

Key concepts: states, actions, rewards, discount factor, return, policy, Q-values, Bellman equation.

Hands-on experimentation strengthens intuition and skill.

Reinforcement Learning Overview

1. What is Reinforcement Learning (RL)?

2. Key RL Concepts

States (S)

Represent all possible situations an agent can be in.

Examples:

Mars rover: positions 1–6

Helicopter: position, orientation, velocity

Chess: board configuration

Actions (A)

Choices available to the agent in a given state.

Examples:

Rover: move left or right

Helicopter: move control sticks

Chess: legal moves

Rewards (R)

Feedback received after taking an action.

Examples:

Rover: 100 for leftmost state, 40 for rightmost, 0 elsewhere

Helicopter: +1 for stable flight, -1000 for crash

Chess: +1 win, 0 tie, -1 loss

Discount Factor (γ)

Weight for future rewards relative to immediate ones.

Values: 0 < γ ≤ 1

Lower γ → agent prefers immediate rewards

Higher γ → agent values long-term rewards

Example: γ = 0.5 (rover), γ = 0.99 (helicopter/chess)

Return (G)

Total discounted sum of future rewards:

$G = R_1 + γ R_2 + γ^2 R_3 + …$

Determines which action sequences are “better” in the long run.

Example for Mars rover at state 4:

Move left: G = 12.5

Move right: G = 10

Observation: Return depends on actions chosen and starting state.

Policy (π)

Rule that maps states to actions to maximize return.

Can be deterministic (fixed action per state) or stochastic (probabilistic).

3. Example: Mars Rover RL Problem

States: 6 positions

Actions: left or right

Rewards: 100 (state 1), 40 (state 6), 0 elsewhere

Discount factor: γ = 0.5

Return calculation: Weighted sum of future rewards

Goal: Choose actions to maximize cumulative reward

4. Generalization to Other Applications

Autonomous Helicopter

States: Position, orientation, speed

Actions: Move control sticks

Rewards: +1 for stable flight, -1000 for crash

Policy: Chooses action based on helicopter state

Game Playing (Chess)

States: Board configuration

Actions: Legal moves

Rewards: +1 win, 0 tie, -1 loss

Policy: Selects best move for each board position

5. Markov Decision Process (MDP)

Formalism for RL problems

Markov Property: Future depends only on current state, not past history

Components: States, actions, rewards, transition dynamics, policy

Representation: Diagram showing states, actions, resulting states, and rewards

6. State-Value (V) and State-Action Value (Q) Functions

V(s): How good it is to be in state s

Q(s,a): How good it is to take action a in state s

Optimal action: Action with highest Q(s,a)

Example Q-values:

Action Q(s,a)
Left -10
Right -20
Stop 0

Optimal action = Stop (highest Q-value)

7. Bellman Equation and Learning Q-values

Bellman equation allows creation of supervised-learning-like targets:

$Q(s,a) = R(s,a) + γ \max_{a'} Q(s',a')$

Enables Q-learning and deep Q-learning (DQN)

8. Continuous vs. Discrete MDPs

Discrete: Finite number of states and actions (Mars rover example)

Continuous: State variables can take infinite values (Lunar Lander, helicopter)

9. Practical Tips

Use environments like OpenAI Gym for experimentation

Calculate returns carefully with discount factors

Understand Q-values and Bellman updates for RL algorithms

✅ Conclusion

RL is about learning to act optimally through experience and rewards.

Key concepts: states, actions, rewards, discount factor, return, policy, Q-values, Bellman equation.

Hands-on experimentation strengthens intuition and skill.

Post a Comment

Contact form