๐ณ From Decision Trees to Random Forests: Key ML Concepts Explained
Machine learning can feel overwhelming with all its jargon — bias, variance, entropy, random forests, transfer learning — the list goes on. In this post, we’ll walk through some of the most important concepts, explained simply using the same style as common exam questions.
๐ Diagnostics in Machine Learning
A diagnostic is a test you run to gain insight into what is (or isn’t) working with a learning algorithm.
- If training error is low but cross-validation error is high → the model has high variance (overfitting).
- If training error is high compared to the baseline → the model has high bias (underfitting).
⚖️ Bias vs. Variance
Balancing bias and variance is at the heart of machine learning:
- High Bias (Underfitting) → Model is too simple.
Fix by: adding features, using a more complex model, or decreasing regularization. - High Variance (Overfitting) → Model is too complex.
Fix by: collecting more data, using regularization, or simplifying the model.
Problem | Symptoms | Solutions |
---|---|---|
High Bias | High training error | More features, more complex model, reduce ฮป |
High Variance | Low training error but high CV error | More data, increase ฮป, simplify model |
๐ฑ Decision Trees Basics
Decision trees split data based on features that give the highest information gain using entropy.
Entropy formula:
H(p) = -p log2(p) - (1 - p) log2(1 - p)
Information Gain:
IG = H(root) - (w_left H(left) + w_right H(right))
Example: If 10 animals = 6 cats + 4 not cats →
H = -0.6 log2(0.6) - 0.4 log2(0.4)
This measures “impurity.” Splits aim to reduce impurity.
๐ฒ Random Forests
A random forest is just a collection of decision trees, but each tree is built slightly differently to avoid them being identical:
- Bootstrap sampling: Train each tree on data sampled with replacement.
- Feature randomness: Each tree only sees a random subset of features at each split.
This diversity makes random forests more powerful and less prone to overfitting.
๐ผ️ Structured vs. Unstructured Data
- Structured data (tables, numbers, categories) → decision trees, gradient boosting, logistic regression often work best.
- Unstructured data (images, audio, text) → neural networks perform better.
Example: A 100x100 image (10,000 pixels) is better handled by a CNN than a decision tree.
๐งฉ Other Useful Concepts
- Error Analysis: Inspect misclassified examples to spot mistakes (e.g., mislabeled data).
- Data Augmentation: Create new training data by rotating/flipping images or adding noise.
- Transfer Learning: Start with a pre-trained model and:
- Train only the output layers (freeze earlier layers).
- Fine-tune all layers with your dataset.
๐ฏ Wrap-Up
In this blog, we covered:
- How to diagnose bias vs. variance
- Entropy and information gain for decision trees
- When to use decision trees vs. neural networks
- The importance of error analysis, data augmentation, and transfer learning
These building blocks are the foundation of many real-world machine learning systems. If you master them, you’ll build strong intuition for diagnosing and improving ML models.