String similarity score/hash

I believe what you’re looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.

As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It’s analogous to creating a flat map of the Earth… you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you’re using to determine whether strings are “alike”.

Leave a Comment