๐ŸŒณ From Decision Trees to Random Forests: Key ML Concepts Explained

๐ŸŒณ From Decision Trees to Random Forests: Key ML Concepts Explained

Machine learning can feel overwhelming with all its jargon — bias, variance, entropy, random forests, transfer learning — the list goes on. In this post, we’ll walk through some of the most important concepts, explained simply using the same style as common exam questions.

๐Ÿ” Diagnostics in Machine Learning

A diagnostic is a test you run to gain insight into what is (or isn’t) working with a learning algorithm.

  • If training error is low but cross-validation error is high → the model has high variance (overfitting).
  • If training error is high compared to the baseline → the model has high bias (underfitting).

⚖️ Bias vs. Variance

Bias vs Variance

Balancing bias and variance is at the heart of machine learning:

  • High Bias (Underfitting) → Model is too simple.
    Fix by: adding features, using a more complex model, or decreasing regularization.
  • High Variance (Overfitting) → Model is too complex.
    Fix by: collecting more data, using regularization, or simplifying the model.
Problem Symptoms Solutions
High Bias High training error More features, more complex model, reduce ฮป
High Variance Low training error but high CV error More data, increase ฮป, simplify model

๐Ÿฑ Decision Trees Basics

Entropy Formula

Decision trees split data based on features that give the highest information gain using entropy.

H(p) = -p log2(p) - (1 - p) log2(1 - p)

Information Gain:

IG = H(root) - (w_left H(left) + w_right H(right))

Example: If 10 animals = 6 cats + 4 not cats →

H = -0.6 log2(0.6) - 0.4 log2(0.4)

This measures “impurity.” Splits aim to reduce impurity.

๐ŸŒฒ Random Forests

Random Forests

A random forest is just a collection of decision trees, but each tree is built slightly differently to avoid them being identical:

  • Bootstrap sampling: Train each tree on data sampled with replacement.
  • Feature randomness: Each tree only sees a random subset of features at each split.

This diversity makes random forests more powerful and less prone to overfitting.

Post a Comment

Previous Post Next Post