Kaggle Contests
I started participating in kaggle contests in December 2014. This post contains my notes for things I have learnt so far. It is still a work in progress.
Important things for performing well in competitions:
- Setup env for rapid iteration and experimentation. Use tools like git, sklearn, ggplot2
- Feature extraction
- Feature engineering
- Prevent overfitting
- Have patience
General flow:
Extract and select features -> Train models -> Evaluate and visualize results -> Identify and handle data oddities -> Data Preprocessing -> ..REPEAT..
If dataset it too large, consider splitting it.
Text Learning
Some of the important feature engineering techniques are
- toLowerCase
- spell corrector
- porting
- unigrams-trigrams
The porter stemming algorithm is a efficient preprocessing step on many english language tasks. Eg. connects -> connect; connecting -> connect etc.
Most useful algorithms
Random Forests, Support Vector Machine, Gradient Boosting Machines -> Blending
SVM, RF, NN, GBM and Multiple Linear Regression Hierarchical ensemble to combine models
RF and GBM perform very well for common classification and regression tasks
Model ensembling usually results in marginal but significant performance gains (Regularisation + diff models provide different way of looking.. 1%-5%)
Feature selection
The boruta feature selection algorithm is robust and reliable
- Wrapper method around random forest and it’s calculated variable importance
- Iteratively train RFs and runs statistical tests to identify features as important or not important
- Widely used in competition winning models to select a small subset of features for use in training more complex models
Deep Learning in computer vision competitions
Useful libraries:
- caffe
- Theano
- Torch7
Downsampling dataset
PCA
Mean normalisation and feature scaling before applying PCA.
Compression: Choose k by % of variance retained (~99%)
Visualization: for k = 2 or 3
Misuse of PCA:
- To prevent overfitting - Might work OK but isn’t a good way to address overfitting.. Use regularization instead - PCA does not use labels (y) .. so reduces info.. PCA more likely to throw away useful information
- Only if using original features won’t work, consider using PCA (Eg. large data - slow/disk space)