Skip to content

Sklearn pipelines, extra classifiers, bug fixes

Patrick Godwin requested to merge sklearn_pipelines into master

I wanted to squeeze in all the functionality from sklearn I wanted in before the code freeze, so here we are.

Additions:

  • Add 3 new regular classifiers:
    • MultiLayerPerception: Gives us a multilayer neural net, while Keras is being sorted out or if someone doesn't still wants one without having Keras.
    • ApproximateKernelSVM: Uses a linear SVM (from liblinear, which scales much better than libsvm that has kernels built in). In order to get around the linear aspect of the model, I've tacked on an approximate kernel transformer in between the feature preprocessing (whitening, etc) and the classifier. This gives us a choice of kernel with some scalability.
    • ApproximateKernelSGD: same as the SVM above except with a generalized linear model that trains itself with stochastic gradient descent. As the number of samples gets very large, this appears to be a good classifier to use in sklearn-land before you get better results with deep neural nets. Could be promising.
  • Add 3 new incremental classifiers. I wanted something more than the Naive Bayes incremental classifier, but most of them didn't really implement a predict_proba method. The three here are the PassiveAggressive classifier (in one of the feature requests in issues), and incremental versions of the MLP and approximate kernel SGD classifiers.
  • Swap out the built-in whitener with a choice of whitening from sklearn. There are two choices here, the StandardScaler which does what you expect, and the RobustScaler which ignores outliers and tends to produce a better result overall. They both also have a partial_fit method and are suitable for incremental use, which was another major thing I wanted.
  • Attach an imputer if a kwarg is set to fill in missing values with mean/medians in that particular column.

Changes:

  • Wrap classifiers in the pipeline object in sklearn. This allows me to train the entire classifier pipeline in one go, does caching and all that. With this, I can make more complicated classifiers with ease. The approximate kernel SVM is not what I'd call 'complicated', but holding individual pieces starts to get a bit unwieldy already.
  • For some of the new classifiers, hold a rank_scaler object in the model that scales output from classifiers that don't support a predict_proba but instead a decision_function into the range [0,1] so that it is friendly for use in calibration maps.
  • Make the .ini file take in param_grid and param_distributions dictionaries in a more friendly form, akin to how bounds are set.
  • Pass in an extra parameter in param_grid to individually set the number of samples for a particular hyperparameter. This was wrongly set to num_samples instead of fine-tuning for each hyperparameter.
  • Remove the default kwarg in these classifiers. It wasn't a required option, just optional.
  • In features.py, define a DEFAULT_COLUMN_VALUE and feed that as Select default rather than have it baked it in there. In that same Select option, put a fixme to change those variable names foo, tmp which I don't know actually do.

Bug Fixes:

  • Individual hyperparameters set in .ini file without using cross-validation weren't actually being set.
  • features_importance() was not actually working because I had references to an estimator variable that didn't exist.

Merge request reports