data-mining – Row Coding

Decision tree vs. Naive Bayes classifier [closed]

November 15, 2023 by Tarik

Decision Trees are very flexible, easy to understand, and easy to debug. They will work with classification problems and regression problems. So if you are trying to predict a categorical value like (red, green, up, down) or if you are trying to predict a continuous value like 2.9, 3.4 etc Decision Trees will handle both … Read more

Can anyone give a real life example of supervised learning and unsupervised learning? [closed]

September 18, 2023 by Tarik

Supervised learning: You get a bunch of photos with information about what is on them and then you train a model to recognize new photos. You have a bunch of molecules and information about which are drugs and you train a model to answer whether a new molecule is also a drug. Unsupervised learning: You … Read more

Kmeans without knowing the number of clusters? [duplicate]

September 9, 2023 by Tarik

One approach is cross-validation. In essence, you pick a subset of your data and cluster it into k clusters, and you ask how well it clusters, compared with the rest of the data: Are you assigning data points to the same cluster memberships, or are they falling into different clusters? If the memberships are roughly … Read more

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

September 3, 2023 by Tarik

Write code yourself. Then it fits your problem best! Boilerplate: Never assume code you download from the net to be correct or optimal… make sure to fully understand it before using it. %matplotlib inline from numpy import array, linspace from sklearn.neighbors import KernelDensity from matplotlib.pyplot import plot a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1) kde = KernelDensity(kernel=”gaussian”, bandwidth=3).fit(a) … Read more

R Random Forests Variable Importance

August 14, 2023 by Tarik

How to calculate the regularization parameter in linear regression

August 6, 2023 by Tarik

The regularization parameter (lambda) is an input to your model so what you probably want to know is how do you select the value of lambda. The regularization parameter reduces overfitting, which reduces the variance of your estimated regression parameters; however, it does this at the expense of adding bias to your estimate. Increasing lambda … Read more

scikit-learn: Predicting new points with DBSCAN

August 2, 2023 by Tarik

While Anony-Mousse has some good points (Clustering is indeed not classifying) I think the ability of assigning new points has it’s usefulness. * Based on the original paper on DBSCAN and robertlaytons ideas on github.com/scikit-learn, I suggest running through core points and assigning to the cluster of the first core point that is within eps … Read more

How many principal components to take?

July 22, 2023 by Tarik

To decide how many eigenvalues/eigenvectors to keep, you should consider your reason for doing PCA in the first place. Are you doing it for reducing storage requirements, to reduce dimensionality for a classification algorithm, or for some other reason? If you don’t have any strict constraints, I recommend plotting the cumulative sum of eigenvalues (assuming … Read more

Calculate AUC in R?

July 13, 2023 by Tarik

PCA For categorical features?

June 8, 2023 by Tarik

I disagree with the others. While you can use PCA on binary data (e.g. one-hot encoded data) that does not mean it is a good thing, or it will work very well. PCA is designed for continuous variables. It tries to minimize variance (=squared deviations). The concept of squared deviations breaks down when you have … Read more