1. What is Reinforcement Learning (RL)?
-
RL is a branch of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards.
-
Unlike supervised learning, RL does not require labeled input-output pairs; instead, it relies on a reward function to guide learning.
-
Analogy: Training a puppy—positive feedback for good actions, negative feedback for bad actions.
RL is a branch of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards.
Unlike supervised learning, RL does not require labeled input-output pairs; instead, it relies on a reward function to guide learning.
Analogy: Training a puppy—positive feedback for good actions, negative feedback for bad actions.
Applications:
-
Robotics (e.g., Mars rover, autonomous helicopter)
-
Games (chess, Go, video games)
-
Optimization problems (factory processes, trading)
2. Key RL Concepts
States (S)
-
Represent all possible situations an agent can be in.
-
Examples:
-
Mars rover: positions 1–6
-
Helicopter: position, orientation, velocity
-
Chess: board configuration
Represent all possible situations an agent can be in.
Examples:
-
Mars rover: positions 1–6
-
Helicopter: position, orientation, velocity
-
Chess: board configuration
Actions (A)
-
Choices available to the agent in a given state.
-
Examples:
-
Rover: move left or right
-
Helicopter: move control sticks
-
Chess: legal moves
Choices available to the agent in a given state.
Examples:
-
Rover: move left or right
-
Helicopter: move control sticks
-
Chess: legal moves
Rewards (R)
-
Feedback received after taking an action.
-
Examples:
-
Rover: 100 for leftmost state, 40 for rightmost, 0 elsewhere
-
Helicopter: +1 for stable flight, -1000 for crash
-
Chess: +1 win, 0 tie, -1 loss
Feedback received after taking an action.
Examples:
-
Rover: 100 for leftmost state, 40 for rightmost, 0 elsewhere
-
Helicopter: +1 for stable flight, -1000 for crash
-
Chess: +1 win, 0 tie, -1 loss
Discount Factor (γ)
-
Weight for future rewards relative to immediate ones.
-
Values: 0 < γ ≤ 1
-
Lower γ → agent prefers immediate rewards
-
Higher γ → agent values long-term rewards
-
Example: γ = 0.5 (rover), γ = 0.99 (helicopter/chess)
Weight for future rewards relative to immediate ones.
Values: 0 < γ ≤ 1
Lower γ → agent prefers immediate rewards
Higher γ → agent values long-term rewards
Example: γ = 0.5 (rover), γ = 0.99 (helicopter/chess)
Return (G)
-
Total discounted sum of future rewards:
-
Determines which action sequences are “better” in the long run.
-
Example for Mars rover at state 4:
-
Move left: G = 12.5
-
Move right: G = 10
-
Observation: Return depends on actions chosen and starting state.
Total discounted sum of future rewards:
Determines which action sequences are “better” in the long run.
Example for Mars rover at state 4:
-
Move left: G = 12.5
-
Move right: G = 10
Observation: Return depends on actions chosen and starting state.
Policy (π)
-
Rule that maps states to actions to maximize return.
-
Can be deterministic (fixed action per state) or stochastic (probabilistic).
Rule that maps states to actions to maximize return.
Can be deterministic (fixed action per state) or stochastic (probabilistic).
3. Example: Mars Rover RL Problem
-
States: 6 positions
-
Actions: left or right
-
Rewards: 100 (state 1), 40 (state 6), 0 elsewhere
-
Discount factor: γ = 0.5
-
Return calculation: Weighted sum of future rewards
-
Goal: Choose actions to maximize cumulative reward
States: 6 positions
Actions: left or right
Rewards: 100 (state 1), 40 (state 6), 0 elsewhere
Discount factor: γ = 0.5
Return calculation: Weighted sum of future rewards
Goal: Choose actions to maximize cumulative reward
4. Generalization to Other Applications
Autonomous Helicopter
-
States: Position, orientation, speed
-
Actions: Move control sticks
-
Rewards: +1 for stable flight, -1000 for crash
-
Policy: Chooses action based on helicopter state
States: Position, orientation, speed
Actions: Move control sticks
Rewards: +1 for stable flight, -1000 for crash
Policy: Chooses action based on helicopter state
Game Playing (Chess)
-
States: Board configuration
-
Actions: Legal moves
-
Rewards: +1 win, 0 tie, -1 loss
-
Policy: Selects best move for each board position
States: Board configuration
Actions: Legal moves
Rewards: +1 win, 0 tie, -1 loss
Policy: Selects best move for each board position
5. Markov Decision Process (MDP)
-
Formalism for RL problems
-
Markov Property: Future depends only on current state, not past history
-
Components: States, actions, rewards, transition dynamics, policy
-
Representation: Diagram showing states, actions, resulting states, and rewards
Formalism for RL problems
Markov Property: Future depends only on current state, not past history
Components: States, actions, rewards, transition dynamics, policy
Representation: Diagram showing states, actions, resulting states, and rewards
6. State-Value (V) and State-Action Value (Q) Functions
-
V(s): How good it is to be in state s
-
Q(s,a): How good it is to take action a in state s
-
Optimal action: Action with highest Q(s,a)
V(s): How good it is to be in state s
Q(s,a): How good it is to take action a in state s
Optimal action: Action with highest Q(s,a)
Example Q-values:
| Action | Q(s,a) |
|---|---|
| Left | -10 |
| Right | -20 |
| Stop | 0 |
-
Optimal action = Stop (highest Q-value)
7. Bellman Equation and Learning Q-values
-
Bellman equation allows creation of supervised-learning-like targets:
-
Enables Q-learning and deep Q-learning (DQN)
Bellman equation allows creation of supervised-learning-like targets:
Enables Q-learning and deep Q-learning (DQN)
8. Continuous vs. Discrete MDPs
-
Discrete: Finite number of states and actions (Mars rover example)
-
Continuous: State variables can take infinite values (Lunar Lander, helicopter)
Discrete: Finite number of states and actions (Mars rover example)
Continuous: State variables can take infinite values (Lunar Lander, helicopter)
9. Practical Tips
-
Use environments like OpenAI Gym for experimentation
-
Calculate returns carefully with discount factors
-
Understand Q-values and Bellman updates for RL algorithms
Use environments like OpenAI Gym for experimentation
Calculate returns carefully with discount factors
Understand Q-values and Bellman updates for RL algorithms
✅ Conclusion
-
RL is about learning to act optimally through experience and rewards.
-
Key concepts: states, actions, rewards, discount factor, return, policy, Q-values, Bellman equation.
-
Hands-on experimentation strengthens intuition and skill.
RL is about learning to act optimally through experience and rewards.
Key concepts: states, actions, rewards, discount factor, return, policy, Q-values, Bellman equation.
Hands-on experimentation strengthens intuition and skill.