This repository contains a technical replication and sensitivity analysis of the Hypothetical Document Embeddings (HyDE) retrieval method. While the original framework (Gao et al., 2022) demonstrates significant zero-shot improvements on general web-search benchmarks, our evaluation on the specialized FiQA (Financial Q&A) dataset reveals critical performance plateaus. Using a 250M-parameter instruction-tuned model, we observed that standard asymmetric retrieval (86.0% Hit@10) outperformed the HyDE pipeline (74.0% Hit@10). Our analysis identifies Stylistic Mismatch and Semantic Drift as the primary drivers of retrieval degradation in domain-specific manifolds.
| Methodology | Retrieval Strategy | Accuracy |
|---|---|---|
| Baseline | Standard Bi-Encoder Search | 86.0% |
| HyDE | Forum-Style Prompting | 74.0% |
| HyDE | Encyclopedia-Style Prompting | 68.0% |
Issue: Legacy BEIR (Benchmarking IR) data loaders relied on custom Python scripts (.py) for data ingestion. As of 2024/2025, Hugging Face has deprecated the execution of these scripts for security reasons, resulting in RuntimeErrors upon initialization.
Mitigation: The ingestion layer was refactored to utilize the MTEB (Massive Text Embedding Benchmark) Parquet-based repository, ensuring a script-free and reproducible data pipeline.
Issue: In datasets like FiQA, the query split contains a high volume of unlabeled entries. Random subsetting frequently resulted in queries without corresponding ground-truth documents in the search index, skewing recall metrics to zero.
Mitigation: Implemented a Ground-Truth-First Filter. The evaluation loop was re-engineered to synchronize the test queries with the qrels (relevance labels) and the corpus subset, ensuring that every tested query possessed a verifiable target in the vector space.
Issue: Sequential LLM inference for 100 queries introduced significant latency and triggered sequential execution warnings on T4 hardware.
Mitigation: Implemented Tensor Batching (Batch Size = 8) to optimize the Attention mechanism's throughput on the GPU, reducing generation latency by approximately 65%.
A significant finding of this replication is the impact of stylistic registers on dense retrieval. The FiQA corpus is composed of informal, financial forum data.
- Observations: Prompting the LLM to generate formal "Encyclopedia" style responses resulted in the lowest retrieval accuracy (68%).
- Implication: Dense retrievers are highly sensitive to the register and dialect of the text. For HyDE to be effective, the generative model must mimic the stylistic manifold of the target corpus.
T-SNE manifold analysis suggests that the "hallucinated" answers from a 250M parameter model often introduce extraneous terminology not present in the target documents. This acts as Semantic Noise, increasing the Euclidean distance between the hypothetical embedding and the ground truth. In specialized domains, a direct asymmetric search with the raw query often provides a cleaner signal than a noisy expansion.
- Retriever:
all-MiniLM-L6-v2(384-dimensional dense vectors) - Generator (HyDE):
google/flan-t5-base(250M parameters) - Vector Search:
FAISS(L2 Distance) - Data Source:
mteb/fiqa
├── report/ # Detailed research paper in LaTeX format
├── Sensitivity_Analysis_of_Hypothetical_Document_Embeddings_(HyDE)_in_Financial_Domain_Retrieval.ipynb # Documented Colab replication script
└── README.md # Technical summary
- Gao, L., et al. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv:2212.10496.
- Muennighoff, N., et al. (2022). MTEB: Massive Text Embedding Benchmark. arXiv:2210.07316.
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.