One hot encoding of string categorical features

If you are on sklearn>0.20.dev0 In [11]: from sklearn.preprocessing import OneHotEncoder …: cat = OneHotEncoder() …: X = np.array([[‘a’, ‘b’, ‘a’, ‘c’], [0, 1, 0, 1]], dtype=object).T …: cat.fit_transform(X).toarray() …: Out[11]: array([[1., 0., 0., 1., 0.], [0., 1., 0., 0., 1.], [1., 0., 0., 1., 0.], [0., 0., 1., 0., 1.]]) If you are on … Read more

Adding dummy columns to the original dataframe

In [77]: df = pd.concat([df, pd.get_dummies(df[‘YEAR’])], axis=1); df Out[77]: JOINED_CO GENDER EXEC_FULLNAME GVKEY YEAR CONAME BECAMECEO \ 5622 NaN MALE Ira A. Eichner 1004 1992 AAR CORP 19550101 5622 NaN MALE Ira A. Eichner 1004 1993 AAR CORP 19550101 5622 NaN MALE Ira A. Eichner 1004 1994 AAR CORP 19550101 5622 NaN MALE Ira A. … Read more

Feature names from OneHotEncoder

A list with the original column names can be passed to get_feature_names. >>> encoder.get_feature_names([‘Sex’, ‘AgeGroup’]) array([‘Sex_female’, ‘Sex_male’, ‘AgeGroup_0’, ‘AgeGroup_15’, ‘AgeGroup_30’, ‘AgeGroup_45’, ‘AgeGroup_60’, ‘AgeGroup_75’], dtype=object) DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. As per sklearn.preprocessing.OneHotEncoder. >>> encoder.get_feature_names_out([‘Sex’, ‘AgeGroup’]) array([‘Sex_female’, ‘Sex_male’, ‘AgeGroup_0’, ‘AgeGroup_15’, ‘AgeGroup_30’, ‘AgeGroup_45’, ‘AgeGroup_60’, ‘AgeGroup_75’], dtype=object)