Locality Sensitive Hashing is a method for placing similar items together. This can be seen as a clustering effort, where similar items, when hashed, will produce a value that is close to each other, and therefore fall into the same bucket.

When we talk about hashing, most of the time we want to form a unique signature avoid any hash collisions (i.e. the result of the hash of two items are not equal). For LSH however, we want to encourage hash collisions for similar items.

There are 2 parts to applying LSH to a problem:

Defining a distance metric between the items

Choosing an appropriate hash function to approximate the distance

Depending on the type of distance metric defined, we use different hash functions to approximate it. Below are some examples of the distance metrics used, and their corresponding hash functions

Distance Metric

Hash Function

Hamming Distance

SimHash

Jaccard Index

MinHash

This means that for each different hash function, how do we define elements to be similiar? For example, if we use MinHash, similar items hashed together would have similar Jaccard Index Values. Likewise for SimHash, similar hashed items would have small Hamming Distance with each other.

The problem for each of the approaches are usually dealing with what features to hash, and the hyperparameters. In the case of MinHash, it would be choosing the appropriate length of a shingle. In doing so, the hashes generated would be close for similar documents.