What’s the difference between predict_proba and decision_function in scikit-learn?

The latter, predict_proba is a method of a (soft) classifier outputting the probability of the instance being in each of the classes. The former, decision_function, finds the distance to the separating hyperplane. For example, a(n) SVM classifier finds hyperplanes separating the space into areas associated with classification outcomes. This function, given a point, finds the … Read more

How to insert Keras model into scikit-learn pipeline?

You need to wrap your Keras model as a Scikit learn model first and then proceed as usual. Here’s a quick example (I’ve omitted the imports for brevity) Here is a full blog post with this one and many other examples: Scikit-learn Pipeline Examples # create a function that returns a model, taking as parameters … Read more

What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?

Some quick preliminaries: Let’s say we have a classification problem with K classes. In a region of feature space represented by the node of a decision tree, recall that the “impurity” of the region is measured by quantifying the inhomogeneity, using the probability of the class in that region. Normally, we estimate: Pr(Class=k) = #(examples … Read more

A progress bar for scikit-learn?

If you initialize the model with verbose=1 before calling fit you should get some kind of output indicating the progress. For example sklearn.ensemble.GradientBoostingClassifer(verbose=1) provides progress output that looks like this: Iter Train Loss Remaining Time 1 1.2811 0.71s 2 1.2595 0.58s 3 1.2402 0.50s 4 1.2263 0.46s 5 1.2121 0.43s 6 1.1999 0.41s 7 1.1876 … Read more

RandomForestClassifier vs ExtraTreesClassifier in scikit learn

Yes both conclusions are correct, although the Random Forest implementation in scikit-learn makes it possible to enable or disable the bootstrap resampling. In practice, RFs are often more compact than ETs. ETs are generally cheaper to train from a computational point of view but can grow much bigger. ETs can sometime generalize better than RFs … Read more

How are feature_importances in RandomForestClassifier determined?

There are indeed several ways to get feature “importances”. As often, there is no strict consensus about what this word means. In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read…). It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in … Read more