HyDE-Replication

Empirical Evaluation of Hypothetical Document Embeddings (HyDE) in the Financial Domain

Abstract

This repository contains a technical replication and sensitivity analysis of the Hypothetical Document Embeddings (HyDE) retrieval method. While the original framework (Gao et al., 2022) demonstrates significant zero-shot improvements on general web-search benchmarks, our evaluation on the specialized FiQA (Financial Q&A) dataset reveals critical performance plateaus. Using a 250M-parameter instruction-tuned model, we observed that standard asymmetric retrieval (86.0% Hit@10) outperformed the HyDE pipeline (74.0% Hit@10). Our analysis identifies Stylistic Mismatch and Semantic Drift as the primary drivers of retrieval degradation in domain-specific manifolds.

Experimental Results (Hit@10)

Methodology	Retrieval Strategy	Accuracy
Baseline	Standard Bi-Encoder Search	86.0%
HyDE	Forum-Style Prompting	74.0%
HyDE	Encyclopedia-Style Prompting	68.0%

Methodological Obstacles and Solutions

1. Dataset Infrastructure Migration

Issue: Legacy BEIR (Benchmarking IR) data loaders relied on custom Python scripts (.py) for data ingestion. As of 2024/2025, Hugging Face has deprecated the execution of these scripts for security reasons, resulting in RuntimeErrors upon initialization.
Mitigation: The ingestion layer was refactored to utilize the MTEB (Massive Text Embedding Benchmark) Parquet-based repository, ensuring a script-free and reproducible data pipeline.

2. Label Synchronization and Validation

Issue: In datasets like FiQA, the query split contains a high volume of unlabeled entries. Random subsetting frequently resulted in queries without corresponding ground-truth documents in the search index, skewing recall metrics to zero.
Mitigation: Implemented a Ground-Truth-First Filter. The evaluation loop was re-engineered to synchronize the test queries with the qrels (relevance labels) and the corpus subset, ensuring that every tested query possessed a verifiable target in the vector space.

3. GPU Parallelization for Inference

Issue: Sequential LLM inference for 100 queries introduced significant latency and triggered sequential execution warnings on T4 hardware.
Mitigation: Implemented Tensor Batching (Batch Size = 8) to optimize the Attention mechanism's throughput on the GPU, reducing generation latency by approximately 65%.

Analysis of Findings

The Style Alignment Hypothesis

A significant finding of this replication is the impact of stylistic registers on dense retrieval. The FiQA corpus is composed of informal, financial forum data.

Observations: Prompting the LLM to generate formal "Encyclopedia" style responses resulted in the lowest retrieval accuracy (68%).
Implication: Dense retrievers are highly sensitive to the register and dialect of the text. For HyDE to be effective, the generative model must mimic the stylistic manifold of the target corpus.

Semantic Drift in Low-Parameter Models

T-SNE manifold analysis suggests that the "hallucinated" answers from a 250M parameter model often introduce extraneous terminology not present in the target documents. This acts as Semantic Noise, increasing the Euclidean distance between the hypothetical embedding and the ground truth. In specialized domains, a direct asymmetric search with the raw query often provides a cleaner signal than a noisy expansion.

Technical Stack

Retriever: all-MiniLM-L6-v2 (384-dimensional dense vectors)
Generator (HyDE): google/flan-t5-base (250M parameters)
Vector Search: FAISS (L2 Distance)
Data Source: mteb/fiqa

Research Repository Structure

├── report/ # Detailed research paper in LaTeX format
├── Sensitivity_Analysis_of_Hypothetical_Document_Embeddings_(HyDE)_in_Financial_Domain_Retrieval.ipynb # Documented Colab replication script
└── README.md                    # Technical summary

References

Gao, L., et al. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv:2212.10496.
Muennighoff, N., et al. (2022). MTEB: Massive Text Embedding Benchmark. arXiv:2210.07316.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
Report.pdf		Report.pdf
Sensitivity_Analysis_of_Hypothetical_Document_Embeddings_(HyDE)_in_Financial_Domain_Retrieval.ipynb		Sensitivity_Analysis_of_Hypothetical_Document_Embeddings_(HyDE)_in_Financial_Domain_Retrieval.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HyDE-Replication

Empirical Evaluation of Hypothetical Document Embeddings (HyDE) in the Financial Domain

Abstract

Experimental Results (Hit@10)

Methodological Obstacles and Solutions

1. Dataset Infrastructure Migration

2. Label Synchronization and Validation

3. GPU Parallelization for Inference

Analysis of Findings

The Style Alignment Hypothesis

Semantic Drift in Low-Parameter Models

Technical Stack

Research Repository Structure

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HyDE-Replication

Empirical Evaluation of Hypothetical Document Embeddings (HyDE) in the Financial Domain

Abstract

Experimental Results (Hit@10)

Methodological Obstacles and Solutions

1. Dataset Infrastructure Migration

2. Label Synchronization and Validation

3. GPU Parallelization for Inference

Analysis of Findings

The Style Alignment Hypothesis

Semantic Drift in Low-Parameter Models

Technical Stack

Research Repository Structure

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages