Locality Sensitive Hashing

Locality Sensitive Hashing is a method for placing similar items together. This can be seen as a clustering effort, where similar items, when hashed, will produce a value that is close to each other, and therefore fall into the same bucket.

When we talk about hashing, most of the time we want to form a unique signature avoid any hash collisions (i.e. the result of the hash of two items are not equal). For LSH however, we want to encourage hash collisions for similar items.

Image for post

There are 2 parts to applying LSH to a problem:

  1. Defining a distance metric between the items
  2. Choosing an appropriate hash function to approximate the distance

Depending on the type of distance metric defined, we use different hash functions to approximate it. Below are some examples of the distance metrics used, and their corresponding hash functions

Distance MetricHash Function
Hamming DistanceSimHash
Jaccard IndexMinHash

This means that for each different hash function, how do we define elements to be similiar? For example, if we use MinHash, similar items hashed together would have similar Jaccard Index Values. Likewise for SimHash, similar hashed items would have small Hamming Distance with each other.

The problem for each of the approaches are usually dealing with what features to hash, and the hyperparameters. In the case of MinHash, it would be choosing the appropriate length of a shingle. In doing so, the hashes generated would be close for similar documents.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s