scikit-learn – Row Coding

What’s the difference between predict_proba and decision_function in scikit-learn?

September 15, 2023 by Tarik

The latter, predict_proba is a method of a (soft) classifier outputting the probability of the instance being in each of the classes. The former, decision_function, finds the distance to the separating hyperplane. For example, a(n) SVM classifier finds hyperplanes separating the space into areas associated with classification outcomes. This function, given a point, finds the … Read more

How to insert Keras model into scikit-learn pipeline?

September 1, 2023 by Tarik

You need to wrap your Keras model as a Scikit learn model first and then proceed as usual. Here’s a quick example (I’ve omitted the imports for brevity) Here is a full blog post with this one and many other examples: Scikit-learn Pipeline Examples # create a function that returns a model, taking as parameters … Read more

What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?

June 6, 2023 by Tarik

Some quick preliminaries: Let’s say we have a classification problem with K classes. In a region of feature space represented by the node of a decision tree, recall that the “impurity” of the region is measured by quantifying the inhomogeneity, using the probability of the class in that region. Normally, we estimate: Pr(Class=k) = #(examples … Read more

A progress bar for scikit-learn?

November 25, 2022 by Tarik

If you initialize the model with verbose=1 before calling fit you should get some kind of output indicating the progress. For example sklearn.ensemble.GradientBoostingClassifer(verbose=1) provides progress output that looks like this: Iter Train Loss Remaining Time 1 1.2811 0.71s 2 1.2595 0.58s 3 1.2402 0.50s 4 1.2263 0.46s 5 1.2121 0.43s 6 1.1999 0.41s 7 1.1876 … Read more

RandomForestClassifier vs ExtraTreesClassifier in scikit learn

November 21, 2022 by Tarik

Yes both conclusions are correct, although the Random Forest implementation in scikit-learn makes it possible to enable or disable the bootstrap resampling. In practice, RFs are often more compact than ETs. ETs are generally cheaper to train from a computational point of view but can grow much bigger. ETs can sometime generalize better than RFs … Read more

sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

November 13, 2022 by Tarik

It looks like sklearn requires the data shape of (row number, column number). If your data shape is (row number, ) like (999, ), it does not work. By using numpy.reshape(), you should change the shape of the array to (999, 1), e.g. using data=data.reshape((999,1)) In my case, it worked with that.

How are feature_importances in RandomForestClassifier determined?

October 26, 2022 by Tarik

There are indeed several ways to get feature “importances”. As often, there is no strict consensus about what this word means. In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read…). It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in … Read more