MinHash
In computer science, matching the performance of the multiple-hash-function scheme.
Contents
Time analysis
The estimator can be computed in time from the two signatures of the given sets, in either variant of the scheme. Therefore, when and are constants, the time to compute the estimated similarity from the signatures is also constant. The signature of each set can be computed in on the size of the set, so when many pairwise similarities need to be estimated this method can lead to a substantial savings in running time compared to doing a full comparison of the members of each set. Specifically, for set size the many hash variant takes time. The single hash variant is generally faster, requiring time to maintain the queue of minimum hash values assuming . Approximate min-wise independence has at most a fixed probability of varying from full independence.
Applications
The original applications for MinHash involved clustering and eliminating near-duplicates among web documents, represented as sets of the words occurring in those documents. Similar techniques have also been used for clustering and near-duplicate elimination for other types of data, such as images: in the case of image data, an image can be represented as a set of smaller subimages cropped from it, or as sets of more complex image feature descriptions.
In , use MinHash as a tool for . Given a database in which each entry has multiple attributes (viewed as a with a row per database entry and a column per attribute) they use MinHash-based approximations to the Jaccard index to identify candidate pairs of attributes that frequently co-occur, and then compute the exact value of the index for only those pairs to determine the ones whose frequencies of co-occurrence are below a given strict threshold.
The MinHash algorithm has been adapted for bioinformatics, where the problem of comparing genome content has a similar theoretical underpinning to that of comparing documents on the web. There are various software implementations for this, including mash and sourmash . These tools allow the very rapid comparison of whole genome sequencing data with reference genomes (around 3 minutes to compare one genome with the 90000 reference genomes in ), and are suitable for speciation and maybe a limited degree of microbial sub-typing. There are also applications for metagenomics .
Other uses
The MinHash scheme may be seen as an instance of , a collection of techniques for using hash functions to map large sets of objects down to smaller hash values in such a way that, when two objects have a small distance from each other, their hash values are likely to be the same. In this instance, the signature of a set may be seen as its hash value. Other locality sensitive hashing techniques exist for between sets and between ; locality sensitive hashing has important applications in algorithms. For large distributed systems, and in particular , there exist modified versions of MinHash to help compute similarities with no dependence on the point dimension.
Evaluation and benchmarks
A large scale evaluation has been conducted by in 2006 to compare the performance of Minhash and SimHash algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling and using Minhash and for personalization.