Image Hashing (pHash & PDQ)

Pure-Java, dependency-free implementations of two perceptual image hashing algorithms, used for finding visually similar images at scale in the netarchive:

PdqHasher — Meta's PDQ hash, a 256-bit perceptual hash.
PhashHasher — the classic DCT-based pHash, 64-bit, compatible with the phim Python/Rust reference implementation.

Both implementations were validated byte-for-byte against their respective reference implementations (Meta's C++ pdqhash and the National Library of Norway's phim) — see PdqHasherTest and PhashHasherTest.

Installation

Add the following dependency to your pom.xml:

<dependency>
    <groupId>io.github.netarchivesuite</groupId>
    <artifactId>image-hashes</artifactId>
    <version>1.1.1</version>
</dependency>

Known limitations: what these hashes can and can't detect

Both algorithms are designed to be robust to re-encoding (resizing, JPEG recompression, mild noise) but are not designed to be robust to geometric transforms (rotation, mirroring) or to aggressive cropping. The table below demonstrates this directly using a single source photo run through twelve transforms.

A Hamming distance close to 0 means "the algorithm considers these the same image." A large Hamming distance means "the algorithm does not consider these similar" — for PDQ, the conventional similarity threshold is ≤ 31 (out of 256 bits); for pHash, a commonly used threshold is ≤ 10 (out of 64 bits). Distances above those thresholds are shown in italics below to flag the cases where similarity detection is expected to fail.

Variant	pHash (hex)	pHash distance	PDQ (hex)	PDQ distance
Original	`d29499b7a706236b`	0	`7c744d5ce9c82d4ec65e6b138dd63b9699966644a360e1e1b165da19b6295c4b`	0
Downscaled to 25%	`d29499b7a706236b`	0	`7c74cd5c2948af4ec6536b138f963b9699d46644a360e1c1b16dda19f6295c4b`	16
Grayscale	`d29499b7a706236b`	0	`7c744d5ca9c8ad4ec65e6b138dd63b9699966644a360e1e1b165da19b6295c4b`	2
Rotated 45°	`664e98b93103ce7e`	28	`413c46332f43b966707e61cc6781470fcc7f6c7ba9c0b1ccb09d931927726e66`	114
Rotated 90°	`0781ecf0de5ce11b`	32	`3abe33aab57a7e70f78525bc1f66884723d08587803a307b78072c371f81fbe0`	136
Rotated 180°	`873ecd0cd2ac76c1`	34	`0961e7f63c9d07e4930bc1b9d803913cccc3cc6ef6354b4be43070b3637cf6a1`	130
Mirrored (horizontal flip)	`7a3e321d2dac89c3`	30	`29611809fc9dfa1b930f3e46da836ec3ccc33391f635b4b4e4308f4ce37c095e`	128
Visible random noise	`d29499b7a706236b`	0	`7c744d5ca9c82d4ec65e6b138dd63b9699966644a360e1e1b165da19f6295c4b`	2
Opaque caption bar	`d0949da78726276b`	6	`5c166d7da9ce2c58c65f6d708f963f9619966744e360e1e121659ab97c29940b`	40
Translucent watermark	`d29499a7a726236b`	2	`5c344d55e9c8ad5ec6576b138dd63b9699946644a360e1e1b165da19f6295c4b`	10
Heavy JPEG recompression (q=15)	`d29499b7a706236b`	0	`5c744d5ca9c8ad4ec65e6b138dd63b96999666c4a360e1e1b165da19b6295c4b`	4
Cropped 15% off each edge	`77dfdc91d1868484`	32	`6a580a90e615a4b4c684788c1a0dfaec7e25f6a17ee5399b19d9813b94fb52ee`	130

Takeaways

Re-encoding transforms (downscale, grayscale, noise, JPEG recompression, translucent watermark) all stay comfortably under both similarity thresholds — these are the cases the hashes are designed for.
Geometric transforms (any rotation, mirroring) and aggressive cropping push both hashes well past their thresholds. This is expected: neither algorithm is rotation-, mirror-, or crop-invariant by design. Cropping is notable because, unlike rotation, it's a very common real-world transform in netarchive material (thumbnails, re-published excerpts) — so a 15% crop here behaves like a full 90° rotation in terms of detectability.
An opaque caption bar costs more distance than a translucent watermark covering a similar area, since the opaque version destroys the underlying pixel information entirely rather than just attenuating it.

See PdqHasherCatImageSetTest.java and PhashHasherCatImageSetTest.java for the full test suite that generated these numbers.

Performance comparison across implementations

Both hash algorithms were benchmarked against independent reference implementations on the same test image (1070×700 png), on the same machine(13th Gen Intel(R) Core(TM) i9-13900K), with a warmed-up JVM (JIT-compiled before timing) and Python's time.perf_counter(). Performance is identical for JPEG images of the same dimensions. The relative ranking between implementations also stays consistent across different image sizes — larger images are slower for all implementations, but the ratios between them remain the same, so the table below is representative regardless of the image resolution in your corpus.

Implementation	Algorithm	ms/image	Throughput
image-hashes (Java)	pdqHash	6.1 ms	163.6 img/sec
image-hashes (Java)	pdqHash (8 dihedral variants)	6.1 ms	163.9 img/sec
image-hashes (Java)	pHash	10.8 ms	93.0 img/sec
phim (Python/C++)	pdqHash (8 dihedral variants)	9.1 ms	110.2 img/sec
phim (Python/C++)	pHash	4.1 ms	242.1 img/sec
Meta official (Java)	pdqHash (naive)	16.3 ms	61.3 img/sec
Meta official (Java)	pdqHash (pre-allocated)	15.6 ms	63.9 img/sec

Notes

PDQ: this library is 2.6x faster than Meta's own official Java reference implementation. The gap is explained by a single optimization: direct DataBufferByte pixel extraction, bypassing the per-pixel getRGB(x, y) call that Meta's implementation still uses. This library is also 1.5x faster than phim's C++ backed Python implementation.
PDQ dihedral variants: computing all 8 rotation/mirror variants via getAllDihedralHashes() costs essentially the same as computing a single hash — both pay the same dominant cost (pixel extraction, Jarosz filter, 2D DCT) exactly once. The 7 extra variants are derived cheaply from the same 16×16 2D DCT buffer without re-running the pipeline.
pHash: phim is 2.6x faster than this library. phim's pHash python is backed by C++ with a likely FFT-based 2D DCT.
Meta's buffer pre-allocation: Meta's fromBufferedImage() API is designed for callers to pre-allocate and reuse scratch buffers across calls. In practice this makes almost no difference (16.3ms naive vs 15.6ms pre-allocated) because the dominant cost is the per-pixel getRGB() loop, not buffer allocation.
No official Meta/Facebook Java implementation of pHash exists — unlike PDQ, pHash has no single canonical reference, so that cell is left blank.
Per-image timings are averages over 1000 hash computations, with all implementations confirmed to produce byte-for-byte identical hashes before timing.

Usage

Both hashers take a BufferedImage — load it however you normally would (from a file, an HTTP response, a WARC record, etc.) and pass it directly.

BufferedImage image = ImageIO.read(new File("photo.jpg"));

pHash

String hash = PhashHasher.getHash(image);
int distance = PhashHasher.hammingDistance(hash1, hash2);

Hashes with a Hamming distance ≤ 10 (out of 64) are typically considered perceptually similar.

PDQ hash

Hash only:

String hash = PdqHasher.getHash(image);
int distance = PdqHasher.hammingDistance(hash1, hash2);

Hash with quality score — useful for filtering out unreliable hashes from flat or low-detail images (quality ≤ 49 indicates a poor hash):

PdqHasher.Result result = PdqHasher.getHashAndQuality(image);
String hash    = result.hash;
int    quality = result.quality;

if (quality > 49) {
    // hash is reliable — store or compare it
}

All 8 dihedral variants (rotations and mirrors) — computed in a single pipeline pass at essentially no extra cost over a single hash:

String[] hashes = PdqHasher.getAllDihedralHashes(image);

// Variant names are available via PdqHasher.DIHEDRAL_NAMES:
// original, rotate90, rotate180, rotate270, flipX, flipY, flipPlus1, flipMinus1

// Check if a query image matches any orientation of a stored image:
int distance = PdqHasher.minHammingDistance(hashes, queryHash);

Hashes with a Hamming distance ≤ 31 (out of 256) are typically considered perceptually similar.

Lanczos resampling compatibility

The pHash algorithm resizes the source image to 32×32 pixels using a Lanczos filter before computing the 2D DCT. Lanczos is not a single deterministic standard — different libraries make different implementation choices and produce different pixel values, resulting in incompatible hashes from the same source image. This library produces byte-for-byte identical hashes to phim that use use Pillow's built-in Lanczos filter for downscaling (implemented in C). The following libraries produce different, incompatible hashes: OpenCV, TwelveMonkeys and scikit-image.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
readme_thumbs		readme_thumbs
src		src
.gitignore		.gitignore
CHANGES.md		CHANGES.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml
release.txt		release.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Hashing (pHash & PDQ)

Installation

Known limitations: what these hashes can and can't detect

Takeaways

Performance comparison across implementations

Notes

Usage

pHash

PDQ hash

Lanczos resampling compatibility

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Image Hashing (pHash & PDQ)

Installation

Known limitations: what these hashes can and can't detect

Takeaways

Performance comparison across implementations

Notes

Usage

pHash

PDQ hash

Lanczos resampling compatibility

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages