One hot encoding of string categorical features

If you are on sklearn>0.20.dev0 In [11]: from sklearn.preprocessing import OneHotEncoder …: cat = OneHotEncoder() …: X = np.array([[‘a’, ‘b’, ‘a’, ‘c’], [0, 1, 0, 1]], dtype=object).T …: cat.fit_transform(X).toarray() …: Out[11]: array([[1., 0., 0., 1., 0.], [0., 1., 0., 0., 1.], [1., 0., 0., 1., 0.], [0., 0., 1., 0., 1.]]) If you are on … Read more

Understanding max_features parameter in RandomForestRegressor

Straight from the documentation: [max_features] is the size of the random subsets of features to consider when splitting a node. So max_features is what you call m. When max_features=”auto”, m = p and no feature subset selection is performed in the trees, so the “random forest” is actually a bagged ensemble of ordinary regression trees. … Read more

Feature selection using scikit-learn

The error message Input X must be non-negative says it all: Pearson’s chi square test (goodness of fit) does not apply to negative values. It’s logical because the chi square test assumes frequencies distribution and a frequency can’t be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative. You are saying that your features … Read more

In the LinearRegression method in sklearn, what exactly is the fit_intercept parameter doing? [closed]

fit_intercept=False sets the y-intercept to 0. If fit_intercept=True, the y-intercept will be determined by the line of best fit. from sklearn.linear_model import LinearRegression from sklearn.datasets import make_regression import numpy as np import matplotlib.pyplot as plt bias = 100 X = np.arange(1000).reshape(-1,1) y_true = np.ravel(X.dot(0.3) + bias) noise = np.random.normal(0, 60, 1000) y = y_true + … Read more

How can I capture return value with Python timeit module?

For Python 3.5 you can override the value of timeit.template timeit.template = “”” def inner(_it, _timer{init}): {setup} _t0 = _timer() for _i in _it: retval = {stmt} _t1 = _timer() return _t1 – _t0, retval “”” unutbu’s answer works for python 3.4 but not 3.5 as the _template_func function appears to have been removed in … Read more

Using statsmodel estimations with scikit-learn cross validation, is it possible?

Indeed, you cannot use cross_val_score directly on statsmodels objects, because of different interface: in statsmodels training data is passed directly into the constructor a separate object contains the result of model estimation However, you can write a simple wrapper to make statsmodels objects look like sklearn estimators: import statsmodels.api as sm from sklearn.base import BaseEstimator, … Read more

No module named ‘sklearn.datasets.samples_generator’

In the latest versions of scikit-learn, there is no module sklearn.datasets.samples_generator – it has been replaced with sklearn.datasets (see the docs); so, according to the make_blobs documentation, your import should simply be: from sklearn.datasets import make_blobs As a general rule, the official documentation is your best friend, and you should definitely consult it first before … Read more