Pure-Java, dependency-free implementations of two perceptual image hashing algorithms, used for finding visually similar images at scale in the netarchive:
PdqHasher— Meta's PDQ hash, a 256-bit perceptual hash.PhashHasher— the classic DCT-based pHash, 64-bit, compatible with the phim Python/Rust reference implementation.
Both implementations were validated byte-for-byte against their respective
reference implementations (Meta's C++ pdqhash and the National Library of
Norway's phim) — see PdqHasherTest and PhashHasherTest.
Add the following dependency to your pom.xml:
<dependency>
<groupId>io.github.netarchivesuite</groupId>
<artifactId>image-hashes</artifactId>
<version>1.1.1</version>
</dependency>Both algorithms are designed to be robust to re-encoding (resizing, JPEG recompression, mild noise) but are not designed to be robust to geometric transforms (rotation, mirroring) or to aggressive cropping. The table below demonstrates this directly using a single source photo run through twelve transforms.
A Hamming distance close to 0 means "the algorithm considers these the same image." A large Hamming distance means "the algorithm does not consider these similar" — for PDQ, the conventional similarity threshold is ≤ 31 (out of 256 bits); for pHash, a commonly used threshold is ≤ 10 (out of 64 bits). Distances above those thresholds are shown in italics below to flag the cases where similarity detection is expected to fail.
- Re-encoding transforms (downscale, grayscale, noise, JPEG recompression, translucent watermark) all stay comfortably under both similarity thresholds — these are the cases the hashes are designed for.
- Geometric transforms (any rotation, mirroring) and aggressive cropping push both hashes well past their thresholds. This is expected: neither algorithm is rotation-, mirror-, or crop-invariant by design. Cropping is notable because, unlike rotation, it's a very common real-world transform in netarchive material (thumbnails, re-published excerpts) — so a 15% crop here behaves like a full 90° rotation in terms of detectability.
- An opaque caption bar costs more distance than a translucent watermark covering a similar area, since the opaque version destroys the underlying pixel information entirely rather than just attenuating it.
See PdqHasherCatImageSetTest.java and PhashHasherCatImageSetTest.java
for the full test suite that generated these numbers.
Both hash algorithms were benchmarked against independent reference implementations on the same test image (1070×700 png), on the same machine(13th Gen Intel(R) Core(TM) i9-13900K), with a warmed-up JVM (JIT-compiled before timing) and Python's time.perf_counter(). Performance is identical for JPEG images of the same dimensions. The relative ranking between implementations also stays consistent across different image sizes — larger images are slower for all implementations, but the ratios between them remain the same, so the table below is representative regardless of the image resolution in your corpus.
| Implementation | Algorithm | ms/image | Throughput |
|---|---|---|---|
| image-hashes (Java) | pdqHash | 6.1 ms | 163.6 img/sec |
| image-hashes (Java) | pdqHash (8 dihedral variants) | 6.1 ms | 163.9 img/sec |
| image-hashes (Java) | pHash | 10.8 ms | 93.0 img/sec |
| phim (Python/C++) | pdqHash (8 dihedral variants) | 9.1 ms | 110.2 img/sec |
| phim (Python/C++) | pHash | 4.1 ms | 242.1 img/sec |
| Meta official (Java) | pdqHash (naive) | 16.3 ms | 61.3 img/sec |
| Meta official (Java) | pdqHash (pre-allocated) | 15.6 ms | 63.9 img/sec |
- PDQ: this library is 2.6x faster than Meta's own official Java reference implementation. The gap is explained by a single optimization: direct
DataBufferBytepixel extraction, bypassing the per-pixelgetRGB(x, y)call that Meta's implementation still uses. This library is also 1.5x faster than phim's C++ backed Python implementation. - PDQ dihedral variants: computing all 8 rotation/mirror variants via
getAllDihedralHashes()costs essentially the same as computing a single hash — both pay the same dominant cost (pixel extraction, Jarosz filter, 2D DCT) exactly once. The 7 extra variants are derived cheaply from the same 16×16 2D DCT buffer without re-running the pipeline. - pHash: phim is 2.6x faster than this library. phim's pHash python is backed by C++ with a likely FFT-based 2D DCT.
- Meta's buffer pre-allocation: Meta's
fromBufferedImage()API is designed for callers to pre-allocate and reuse scratch buffers across calls. In practice this makes almost no difference (16.3ms naive vs 15.6ms pre-allocated) because the dominant cost is the per-pixelgetRGB()loop, not buffer allocation. - No official Meta/Facebook Java implementation of pHash exists — unlike PDQ, pHash has no single canonical reference, so that cell is left blank.
- Per-image timings are averages over 1000 hash computations, with all implementations confirmed to produce byte-for-byte identical hashes before timing.
Both hashers take a BufferedImage — load it however you normally would
(from a file, an HTTP response, a WARC record, etc.) and pass it directly.
BufferedImage image = ImageIO.read(new File("photo.jpg"));String hash = PhashHasher.getHash(image);
int distance = PhashHasher.hammingDistance(hash1, hash2);Hashes with a Hamming distance ≤ 10 (out of 64) are typically considered perceptually similar.
Hash only:
String hash = PdqHasher.getHash(image);
int distance = PdqHasher.hammingDistance(hash1, hash2);Hash with quality score — useful for filtering out unreliable hashes from flat or low-detail images (quality ≤ 49 indicates a poor hash):
PdqHasher.Result result = PdqHasher.getHashAndQuality(image);
String hash = result.hash;
int quality = result.quality;
if (quality > 49) {
// hash is reliable — store or compare it
}All 8 dihedral variants (rotations and mirrors) — computed in a single pipeline pass at essentially no extra cost over a single hash:
String[] hashes = PdqHasher.getAllDihedralHashes(image);
// Variant names are available via PdqHasher.DIHEDRAL_NAMES:
// original, rotate90, rotate180, rotate270, flipX, flipY, flipPlus1, flipMinus1
// Check if a query image matches any orientation of a stored image:
int distance = PdqHasher.minHammingDistance(hashes, queryHash);Hashes with a Hamming distance ≤ 31 (out of 256) are typically considered perceptually similar.
The pHash algorithm resizes the source image to 32×32 pixels using a Lanczos filter before computing the 2D DCT. Lanczos is not a single deterministic standard — different libraries make different implementation choices and produce different pixel values, resulting in incompatible hashes from the same source image. This library produces byte-for-byte identical hashes to phim that use use Pillow's built-in Lanczos filter for downscaling (implemented in C). The following libraries produce different, incompatible hashes: OpenCV, TwelveMonkeys and scikit-image.