Reinforcement Learning Overview

1. What is Reinforcement Learning (RL)?

  • RL is a branch of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards.

  • Unlike supervised learning, RL does not require labeled input-output pairs; instead, it relies on a reward function to guide learning.

  • Analogy: Training a puppy—positive feedback for good actions, negative feedback for bad actions.

Applications:

  • Robotics (e.g., Mars rover, autonomous helicopter)

  • Games (chess, Go, video games)

  • Optimization problems (factory processes, trading)


2. Key RL Concepts

States (S)

  • Represent all possible situations an agent can be in.

  • Examples:

    • Mars rover: positions 1–6

    • Helicopter: position, orientation, velocity

    • Chess: board configuration

Actions (A)

  • Choices available to the agent in a given state.

  • Examples:

    • Rover: move left or right

    • Helicopter: move control sticks

    • Chess: legal moves

Rewards (R)

  • Feedback received after taking an action.

  • Examples:

    • Rover: 100 for leftmost state, 40 for rightmost, 0 elsewhere

    • Helicopter: +1 for stable flight, -1000 for crash

    • Chess: +1 win, 0 tie, -1 loss

Discount Factor (γ)

  • Weight for future rewards relative to immediate ones.

  • Values: 0 < γ ≤ 1

  • Lower γ → agent prefers immediate rewards

  • Higher γ → agent values long-term rewards

  • Example: γ = 0.5 (rover), γ = 0.99 (helicopter/chess)

Return (G)

  • Total discounted sum of future rewards:

G=R1+γR2+γ2R3+G = R_1 + γ R_2 + γ^2 R_3 + …
  • Determines which action sequences are “better” in the long run.

  • Example for Mars rover at state 4:

    • Move left: G = 12.5

    • Move right: G = 10

  • Observation: Return depends on actions chosen and starting state.

Policy (π)

  • Rule that maps states to actions to maximize return.

  • Can be deterministic (fixed action per state) or stochastic (probabilistic).


3. Example: Mars Rover RL Problem

  • States: 6 positions

  • Actions: left or right

  • Rewards: 100 (state 1), 40 (state 6), 0 elsewhere

  • Discount factor: γ = 0.5

  • Return calculation: Weighted sum of future rewards

  • Goal: Choose actions to maximize cumulative reward


4. Generalization to Other Applications

Autonomous Helicopter

  • States: Position, orientation, speed

  • Actions: Move control sticks

  • Rewards: +1 for stable flight, -1000 for crash

  • Policy: Chooses action based on helicopter state

Game Playing (Chess)

  • States: Board configuration

  • Actions: Legal moves

  • Rewards: +1 win, 0 tie, -1 loss

  • Policy: Selects best move for each board position


5. Markov Decision Process (MDP)

  • Formalism for RL problems

  • Markov Property: Future depends only on current state, not past history

  • Components: States, actions, rewards, transition dynamics, policy

  • Representation: Diagram showing states, actions, resulting states, and rewards


6. State-Value (V) and State-Action Value (Q) Functions

  • V(s): How good it is to be in state s

  • Q(s,a): How good it is to take action a in state s

  • Optimal action: Action with highest Q(s,a)

Example Q-values:

ActionQ(s,a)
Left-10
Right-20
Stop0
  • Optimal action = Stop (highest Q-value)


7. Bellman Equation and Learning Q-values

  • Bellman equation allows creation of supervised-learning-like targets:

Q(s,a)=R(s,a)+γmaxaQ(s,a)Q(s,a) = R(s,a) + γ \max_{a'} Q(s',a')
  • Enables Q-learning and deep Q-learning (DQN)


8. Continuous vs. Discrete MDPs

  • Discrete: Finite number of states and actions (Mars rover example)

  • Continuous: State variables can take infinite values (Lunar Lander, helicopter)


9. Practical Tips

  • Use environments like OpenAI Gym for experimentation

  • Calculate returns carefully with discount factors

  • Understand Q-values and Bellman updates for RL algorithms


Conclusion

  • RL is about learning to act optimally through experience and rewards.

  • Key concepts: states, actions, rewards, discount factor, return, policy, Q-values, Bellman equation.

  • Hands-on experimentation strengthens intuition and skill.

Post a Comment

Previous Post Next Post