Understanding max_features parameter in RandomForestRegressor

Straight from the documentation: [max_features] is the size of the random subsets of features to consider when splitting a node. So max_features is what you call m. When max_features=”auto”, m = p and no feature subset selection is performed in the trees, so the “random forest” is actually a bagged ensemble of ordinary regression trees. … Read more

Feature selection using scikit-learn

The error message Input X must be non-negative says it all: Pearson’s chi square test (goodness of fit) does not apply to negative values. It’s logical because the chi square test assumes frequencies distribution and a frequency can’t be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative. You are saying that your features … Read more

Correlated features and classification accuracy

Correlated features do not affect classification accuracy per se. The problem in realistic situations is that we have a finite number of training examples with which to train a classifier. For a fixed number of training examples, increasing the number of features typically increases classification accuracy to a point but as the number of features … Read more

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

Setting the vocabulary explicitly means no vocabulary is learned from data. If you don’t set it, you get: >>> v = CountVectorizer(ngram_range=(1, 2)) >>> pprint(v.fit([“an apple a day keeps the doctor away”]).vocabulary_) {u’an’: 0, u’an apple’: 1, u’apple’: 2, u’apple day’: 3, u’away’: 4, u’day’: 5, u’day keeps’: 6, u’doctor’: 7, u’doctor away’: 8, u’keeps’: … Read more

Random Forest Feature Importance Chart using Python

Here is an example using the iris data set. >>> from sklearn.datasets import load_iris >>> iris = load_iris() >>> rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42) >>> rnd_clf.fit(iris[“data”], iris[“target”]) >>> for name, importance in zip(iris[“feature_names”], rnd_clf.feature_importances_): … print(name, “=”, importance) sepal length (cm) = 0.112492250999 sepal width (cm) = 0.0231192882825 petal length (cm) = 0.441030464364 petal width … Read more

Feature/Variable importance after a PCA analysis

First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example, I am using the iris data. Before the example, please note that the basic idea when … Read more

Linear regression analysis with string/categorical features (variables)?

Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent. Usually there are three possibilities: One-Hot encoding for categorical data Arbitrary numbers for ordinal data Use something like group means for categorical data (e. g. mean prices for city districts). You have to be carefull to not infuse … Read more

The easiest way for getting feature names after running SelectKBest in Scikit Learn

This doesn’t require loops. # Create and fit selector selector = SelectKBest(f_classif, k=5) selector.fit(features_df, target) # Get columns to keep and create new dataframe with those only cols_idxs = selector.get_support(indices=True) features_df_new = features_df.iloc[:,cols_idxs]

How are feature_importances in RandomForestClassifier determined?

There are indeed several ways to get feature “importances”. As often, there is no strict consensus about what this word means. In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read…). It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in … Read more