🌳 From Decision Trees to Random Forests: Key ML Concepts Explained

🌳 From Decision Trees to Random Forests: Key ML Concepts Explained

Machine learning can feel overwhelming with all its jargon — bias, variance, entropy, random forests, transfer learning — the list goes on. In this post, we’ll walk through some of the most important concepts, explained simply using the same style as common exam questions.

🔍 Diagnostics in Machine Learning

A diagnostic is a test you run to gain insight into what is (or isn’t) working with a learning algorithm.

  • If training error is low but cross-validation error is high → the model has high variance (overfitting).
  • If training error is high compared to the baseline → the model has high bias (underfitting).

⚖️ Bias vs. Variance

Bias vs Variance

Balancing bias and variance is at the heart of machine learning:

  • High Bias (Underfitting) → Model is too simple.
    Fix by: adding features, using a more complex model, or decreasing regularization.
  • High Variance (Overfitting) → Model is too complex.
    Fix by: collecting more data, using regularization, or simplifying the model.
Problem Symptoms Solutions
High Bias High training error More features, more complex model, reduce λ
High Variance Low training error but high CV error More data, increase λ, simplify model

🐱 Decision Trees Basics

Entropy Formula

Decision trees split data based on features that give the highest information gain using entropy.

H(p) = -p log2(p) - (1 - p) log2(1 - p)

Information Gain:

IG = H(root) - (w_left H(left) + w_right H(right))

Example: If 10 animals = 6 cats + 4 not cats →

H = -0.6 log2(0.6) - 0.4 log2(0.4)

This measures “impurity.” Splits aim to reduce impurity.

🌲 Random Forests

Random Forests

A random forest is just a collection of decision trees, but each tree is built slightly differently to avoid them being identical:

  • Bootstrap sampling: Train each tree on data sampled with replacement.
  • Feature randomness: Each tree only sees a random subset of features at each split.

This diversity makes random forests more powerful and less prone to overfitting.

Post a Comment

Previous Post Next Post