scikit-learn – Row Coding

One hot encoding of string categorical features

November 29, 2023 by Tarik

If you are on sklearn>0.20.dev0 In [11]: from sklearn.preprocessing import OneHotEncoder …: cat = OneHotEncoder() …: X = np.array([[‘a’, ‘b’, ‘a’, ‘c’], [0, 1, 0, 1]], dtype=object).T …: cat.fit_transform(X).toarray() …: Out[11]: array([[1., 0., 0., 1., 0.], [0., 1., 0., 0., 1.], [1., 0., 0., 1., 0.], [0., 0., 1., 0., 1.]]) If you are on … Read more

Understanding max_features parameter in RandomForestRegressor

November 28, 2023 by Tarik

Straight from the documentation: [max_features] is the size of the random subsets of features to consider when splitting a node. So max_features is what you call m. When max_features=”auto”, m = p and no feature subset selection is performed in the trees, so the “random forest” is actually a bagged ensemble of ordinary regression trees. … Read more

Feature selection using scikit-learn

November 28, 2023 by Tarik

The error message Input X must be non-negative says it all: Pearson’s chi square test (goodness of fit) does not apply to negative values. It’s logical because the chi square test assumes frequencies distribution and a frequency can’t be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative. You are saying that your features … Read more

how to explain the decision tree from scikit-learn

November 27, 2023 by Tarik

The value line in each box is telling you how many samples at that node fall into each category, in order. That’s why, in each box, the numbers in value add up to the number shown in sample. For instance, in your red box, 91+212+113=416. So this means if you reach this node, there were … Read more

In the LinearRegression method in sklearn, what exactly is the fit_intercept parameter doing? [closed]

November 27, 2023 by Tarik

fit_intercept=False sets the y-intercept to 0. If fit_intercept=True, the y-intercept will be determined by the line of best fit. from sklearn.linear_model import LinearRegression from sklearn.datasets import make_regression import numpy as np import matplotlib.pyplot as plt bias = 100 X = np.arange(1000).reshape(-1,1) y_true = np.ravel(X.dot(0.3) + bias) noise = np.random.normal(0, 60, 1000) y = y_true + … Read more

How can I capture return value with Python timeit module?

November 27, 2023 by Tarik

For Python 3.5 you can override the value of timeit.template timeit.template = “”” def inner(_it, _timer{init}): {setup} _t0 = _timer() for _i in _it: retval = {stmt} _t1 = _timer() return _t1 – _t0, retval “”” unutbu’s answer works for python 3.4 but not 3.5 as the _template_func function appears to have been removed in … Read more

Is F1 micro the same as Accuracy?

November 26, 2023 by Tarik

In classification tasks for which every test case is guaranteed to be assigned to exactly one class, micro-F is equivalent to accuracy. It won’t be the case in multi-label classification.

pip: pulling updates from remote git repository

November 26, 2023 by Tarik

pip searches for the library in the Python package index. Your version is newer than the newest one in there, so pip won’t update it. You’ll have to reinstall from Git: $ pip install git+git://github.com/scikit-learn/scikit-learn@main

Using statsmodel estimations with scikit-learn cross validation, is it possible?

November 26, 2023 by Tarik

Indeed, you cannot use cross_val_score directly on statsmodels objects, because of different interface: in statsmodels training data is passed directly into the constructor a separate object contains the result of model estimation However, you can write a simple wrapper to make statsmodels objects look like sklearn estimators: import statsmodels.api as sm from sklearn.base import BaseEstimator, … Read more

No module named ‘sklearn.datasets.samples_generator’

November 24, 2023 by Tarik

In the latest versions of scikit-learn, there is no module sklearn.datasets.samples_generator – it has been replaced with sklearn.datasets (see the docs); so, according to the make_blobs documentation, your import should simply be: from sklearn.datasets import make_blobs As a general rule, the official documentation is your best friend, and you should definitely consult it first before … Read more