๐ณ From Decision Trees to Random Forests: Key ML Concepts Explained
Machine learning can feel overwhelming with all its jargon — bias, variance, entropy, random forests, transfer learning — the list goes on. In this post, we’ll walk through some of the most important concepts, explained simply using the same style as common exam questions.
๐ Diagnostics in Machine Learning
A diagnostic is a test you run to gain insight into what is (or isn’t) working with a learning algorithm.
- If training error is low but cross-validation error is high → the model has high variance (overfitting).
- If training error is high compared to the baseline → the model has high bias (underfitting).
⚖️ Bias vs. Variance
Balancing bias and variance is at the heart of machine learning:
- High Bias (Underfitting) → Model is too simple.
Fix by: adding features, using a more complex model, or decreasing regularization. - High Variance (Overfitting) → Model is too complex.
Fix by: collecting more data, using regularization, or simplifying the model.
| Problem | Symptoms | Solutions |
|---|---|---|
| High Bias | High training error | More features, more complex model, reduce ฮป |
| High Variance | Low training error but high CV error | More data, increase ฮป, simplify model |
๐ฑ Decision Trees Basics
Decision trees split data based on features that give the highest information gain using entropy.
H(p) = -p log2(p) - (1 - p) log2(1 - p)
Information Gain:
IG = H(root) - (w_left H(left) + w_right H(right))
Example: If 10 animals = 6 cats + 4 not cats →
H = -0.6 log2(0.6) - 0.4 log2(0.4)
This measures “impurity.” Splits aim to reduce impurity.
๐ฒ Random Forests
A random forest is just a collection of decision trees, but each tree is built slightly differently to avoid them being identical:
- Bootstrap sampling: Train each tree on data sampled with replacement.
- Feature randomness: Each tree only sees a random subset of features at each split.
This diversity makes random forests more powerful and less prone to overfitting.