I started participating in kaggle contests in December 2014. This post contains my notes for things I have learnt so far. It is still a work in progress.

Important things for performing well in competitions:

Setup env for rapid iteration and experimentation. Use tools like git, sklearn, ggplot2
Feature extraction
Feature engineering
Prevent overfitting
Have patience

General flow:

Extract and select features -> Train models -> Evaluate and visualize results -> Identify and handle data oddities -> Data Preprocessing -> ..REPEAT..

If dataset it too large, consider splitting it.

Text Learning

Some of the important feature engineering techniques are

toLowerCase
spell corrector
porting
unigrams-trigrams

The porter stemming algorithm is a efficient preprocessing step on many english language tasks. Eg. connects -> connect; connecting -> connect etc.

Most useful algorithms

Random Forests, Support Vector Machine, Gradient Boosting Machines -> Blending

SVM, RF, NN, GBM and Multiple Linear Regression Hierarchical ensemble to combine models

RF and GBM perform very well for common classification and regression tasks

Model ensembling usually results in marginal but significant performance gains (Regularisation + diff models provide different way of looking.. 1%-5%)

Feature selection

The boruta feature selection algorithm is robust and reliable

Wrapper method around random forest and it’s calculated variable importance
Iteratively train RFs and runs statistical tests to identify features as important or not important
Widely used in competition winning models to select a small subset of features for use in training more complex models

Deep Learning in computer vision competitions

Useful libraries:

caffe
Theano
Torch7

Downsampling dataset

PCA

Mean normalisation and feature scaling before applying PCA.

Compression: Choose k by % of variance retained (~99%)

Visualization: for k = 2 or 3

Misuse of PCA:

To prevent overfitting - Might work OK but isn’t a good way to address overfitting.. Use regularization instead - PCA does not use labels (y) .. so reduces info.. PCA more likely to throw away useful information
Only if using original features won’t work, consider using PCA (Eg. large data - slow/disk space)