Scikit-learn uses LibSVM internally, and this in turn uses Platt scaling, as detailed in this note by the LibSVM authors, to calibrate the SVM to produce probabilities in addition to class predictions.
Platt scaling requires first training the SVM as usual, then optimizing parameter vectors A and B such that
P(y|X) = 1 / (1 + exp(A * f(X) + B))
f(X) is the signed distance of a sample from the hyperplane (scikit-learn’s
decision_function method). You may recognize the logistic sigmoid in this definition, the same function that logistic regression and neural nets use for turning decision functions into probability estimates.
Mind you: the
B parameter, the “intercept” or “bias” or whatever you like to call it, can cause predictions based on probability estimates from this model to be inconsistent with the ones you get from the SVM decision function
f. E.g. suppose that
f(X) = 10, then the prediction for
X is positive; but if
B = -9.9 and
A = 1, then
P(y|X) = .475. I’m pulling these numbers out of thin air, but you’ve noticed that this can occur in practice.
Effectively, Platt scaling trains a probability model on top of the SVM’s outputs under a cross-entropy loss function. To prevent this model from overfitting, it uses an internal five-fold cross validation, meaning that training SVMs with
probability=True can be quite a lot more expensive than a vanilla, non-probabilistic SVM.