Given a 1 TB data set on disk with around 1 KB per data record, how can I find duplicates using 512 MB RAM and infinite disk space?

The solutions offered so far seem too complicated. A Bloom filter, while being the data structure du jour for the last several years, isn’t best applied in a situation like this: because no data can be associated with the hashed content, you must not only maintain the Bloom filter, but you must still record each (only 6-bit!) hash value and record to disk, destroying the benefit of the bloom filter and having a preposterously high collision rate.

On the other hand, merge sorting the entire terabyte is not only going to take O(n log n) comparisons, but O(n log n) disk traffic, since the majority of the intermediate files would have to be merged from disk, rather than from memory. Any real solution should try to reduce disk traffic as much as possible, since that’s our primary bottleneck.

My solution is simple, making one assumption: that the terabyte of data is recorded in what’s effectively one file.

Iterate through the records of the terabyte file and hash them. A cryptographic hash is unnecessary, costly and too large here; instead, use something like the 64-bit version of murmurhash. It can hash more than 2 GiB/sec (far faster than we’ll likely need, given the speed of storage these days) and has excellent (though not cryptographically secure) collision resistance. With a 64-bit hash, we would expect our first collision at 2^32, so it’s probable that our approximately one billion records will not have any collisions at all.

Write the hashes and their associated record offsets out to another file. Since the records contain arbitrary binary data, we can’t rely on Unix’s sort(1) to sort it, because some of the hashes and offsets may contain what sort(1) will interpret as newlines. We’ll simply write the records out as fixed-width (probably 16 bytes: 8 bytes for the murmur2 64-bit hash, and 8 bytes for the offset in the terabyte file) records. The resulting file should be about 16 GB, given our number of records.

We can sort this file by reading the number of records which will safely fit into memory and sorting them, flushing the sorted chunks back to disk. We can fit more records into memory with a heapsort (it uses O(1) space) than with a quicksort (which uses O(log n) memory for the call stack), but in most implementations, quicksort wins by virtue of its memory locality and lower instruction count. These intermediate files (there should be 35-40 of them) will be written out to disk.

The last step is to merge these files (in memory; there’s no need to store a result on disk for this) collecting all hash collisions and looking up the associated records in the terabyte file, comparing the records for duplication and emitting the records (or their offsets) in whatever way the problem specifies.

As far as I can tell, this task hits the disk significantly less than any other offered solution, and it’s very conceptually simple: hash the records, look for duplicates in the hashes, and verify in the actual records.

For disk I/O, it would read the terabyte data file, write 16 GB to disk, read that 16 GB from disk and write it back sorted, then read it and return the duplicates. As an optimization, the process hashing the records could accumulate them in memory before flushing them out to disk, sorting them before doing so: that cuts out the 16 GB intermediate file, and allows the process to move from hashing directly to merging and reporting duplicates.

Leave a Comment