RAG using LangChain : Part 5-Hypothetical Document Embeddings(HyDE)

Jayant Pal
7 min readJun 24, 2024

--

Photo by Ramón Salinero on Unsplash

Retrievers

In the previous articles, we discussed about various retrievers that can be used in a RAG pipeline. In retrieval process, we are comparing the query embeddings with the document embeddings and retrieving the most similar documents. Essentially, we are matching the query with the content that has the information that can answer the query. So it’s a question-answer comparison.

Now, there can be several issues that can occur during retrieval process.

  • The document base may be too big and hence the retrieved documents might not contain important information.
  • If the retriever itself is not good, the LLM will either hallucinate or not respond to the user query.
  • Since we are comparing question to answer, the query embedding sometimes might not be sufficient enough to retrieve the correct documents.

So, to ensure that the retrieval process is effective, instead of question-answer comparison, we need to have answer-answer comparison. This is where Hypothetical Document Embeddings can be used.

What is Hypothetical Document Embedding(HyDE) ?

HyDE uses a LLM to generate a “fake” hypothetical document for a given user query. It then embeds the document which is then used to look up for real documents that are similar to the hypothetical document.

The underlying concept here is that the hypothetical document may be closer to the real documents in the embedding space than the query.

Source: https://wfhbrian.com/revolutionizing-search-how-hypothetical-document-embeddings-hyde-can-save-time-and-increase-productivity/

HyDE in practice using LangChain

Import all necessary dependencies.

from langchain.chat_models import ChatOpenAI
from langchain_community.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import Chroma

Setting up the LLM and creating a document using WikipediaLoader, loading and splitting it using RecursiveCharacterTextSplitter

import os

with open('../../openai_api_key.txt') as f:
api_key = f.read()
os.environ['OPENAI_API_KEY'] = api_key

chat = ChatOpenAI()

chunk_size = 300
chunk_overlap = 100

# loading data
loader = WikipediaLoader(query="Steve Jobs", load_max_docs=5)
documents = loader.load()

# text splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap = chunk_overlap)
docs = text_splitter.split_documents(documents=documents)

Let’s set up the embedding model

embedding_function = HuggingFaceBgeEmbeddings(
model_name = "BAAI/bge-small-en-v1.5",
model_kwargs = {'device':'cpu'},
encode_kwargs = {'normalize_embeddings':True}
)

Now let’s create a vector store and set up the base retriever using it.

# creating vector store
db = Chroma.from_documents(documents = docs,embedding=embedding_function,persist_directory = "output/steve_jobs_for_hyde.db")

# create the retriever
base_retriever = db.as_retriever(search_kwargs = {"k":5})

Creating a prompt template for generating HyDE

from langchain.prompts.chat import SystemMessagePromptTemplate, ChatPromptTemplate

def get_hypo_doc(query):
template = """Imagine you are an expert writing a detailed explanation on the topic: '{query}'
Your response should be comprehensive and include all key points that would be found in the top search result."""

system_message_prompt = SystemMessagePromptTemplate.from_template(template = template)
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt])
messages = chat_prompt.format_prompt(query = query).to_messages()
response = chat(messages = messages)
hypo_doc = response.content
return hypo_doc

Getting the relevant Hypothetical documents

query = 'When was Steve Jobs fired from Apple?'
print(get_hypo_doc(query=query))


# OUTPUT

Steve Jobs was fired from Apple on September 17, 1985. The decision to remove Jobs from his position as the head of the Macintosh division was made by then-CEO John Sculley, who had been brought in by Jobs himself to help run the company. Jobs' management style and clashes with other executives led to his ousting from the company he co-founded.
Following his departure from Apple, Jobs went on to found NeXT Inc., a computer platform development company, and later acquired The Graphics Group, which would eventually become Pixar Animation Studios. Jobs returned to Apple in 1997 when the company acquired NeXT, and he eventually became CEO once again, leading the company to become one of the most successful tech companies in the world.
The date of Steve Jobs' firing from Apple, September 17, 1985, is seen as a pivotal moment in his career and the history of Apple Inc. It marked a period of struggle for both Jobs and the company, but ultimately led to Jobs' growth as a leader and his eventual triumphant return to Apple.

We will be using this hypothetical answer to answer the user query.

matched_doc = base_retriever.get_relevant_documents(query = get_hypo_doc(query))
print(matched_doc)

# OUTPUT
[Document(page_content="In 1985, Jobs departed Apple after a long power struggle with the company's board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets,", metadata={'source': 'https://en.wikipedia.org/wiki/Steve_Jobs', 'summary': 'Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American businessman, inventor, and investor best known for co-founding the technology company Apple Inc. Jobs was also the founder of NeXT and chairman and majority shareholder of Pixar. He was a pioneer of the personal computer revolution of the 1970s and 1980s, along with his early business partner and fellow Apple co-founder Steve Wozniak.\nJobs was born in San Francisco in 1955 and adopted shortly afterwards. He attended Reed College in 1972 before withdrawing that same year. In 1974, he traveled through India, seeking enlightenment before later studying Zen Buddhism. He and Wozniak co-founded Apple in 1976 to further develop and sell Wozniak\'s Apple I personal computer. Together, the duo gained fame and wealth a year later with production and sale of the Apple II, one of the first highly successful mass-produced microcomputers. Jobs saw the commercial potential of the Xerox Alto in 1979, which was mouse-driven and had a graphical user interface (GUI). This led to the development of the unsuccessful Apple Lisa in 1983, followed by the breakthrough Macintosh in 1984, the first mass-produced computer with a GUI. The Macintosh launched the desktop publishing industry in 1985 with the addition of the Apple LaserWriter, the first laser printer to feature vector graphics and PostScript.\nIn 1985, Jobs departed Apple after a long power struggle with the company\'s board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets, serving as its CEO. In 1986, he helped develop the visual effects industry by funding the computer graphics division of Lucasfilm that eventually spun off independently as Pixar, which produced the first 3D computer-animated feature film Toy Story (1995) and became a leading animation studio, producing over 27 films since.\nIn 1997, Jobs returned to Apple as CEO after the company\'s acquisition of NeXT. He was largely responsible for reviving Apple, which was on the verge of bankruptcy. He worked closely with British designer Jony Ive to develop a line of products and services that had larger cultural ramifications, beginning with the "Think different" advertising campaign, and leading to the iMac, iTunes, Mac OS X, Apple Store, iPod, iTunes Store, iPhone, App Store, and iPad. In 2003, Jobs was diagnosed with a pancreatic neuroendocrine tumor. He died of respiratory arrest related to the tumor in 2011; in 2022, he was posthumously awarded the Presidential Medal of Freedom.', 'title': 'Steve Jobs'})]

We can see that using this hypothetical answer, the LLM can generate a correct answer to the user query without hallucination. Now, all the steps that we performed is a manual process to create Hypothetical Embeddings, wherein we are defining a prompt template to define hypothetical answer and then performing similarity search between the answer and the document chunks.

We can also create Hypothetical Documents using LangChain’s predefined functions.

HyDE from Chains

from langchain.chains import HypotheticalDocumentEmbedder

hyde_embedding_function = HypotheticalDocumentEmbedder.from_llm(llm = chat, base_embeddings = embedding_function, prompt_key = 'web_search' )

The HypotheticalDocumentEmbedder class takes care of creating hypothetical answers, embedding them and retrieving similar chunks.

Default prompts: [‘web_search’, ‘sci_fact’, ‘arguana’, ‘trec_covid’, ‘fiqa’, ‘dbpedia_entity’, ‘trec_news’, ‘mr_tydi’]

  • web_search: This key is likely used for general web search tasks where the goal is to retrieve the most relevant documents from the web based on a user’s query.
  • sci_fact: This could be related to scientific fact verification, where the system retrieves documents that can confirm or refute a scientific claim.

Creating the database with Hypothetical Document Embedding function

doc_db = Chroma.from_documents(docs, hyde_embedding_function,persist_directory='output/steve_job_hyde_chains')

Getting the matched documents..

matched_docs_new = doc_db.similarity_search(query)

for doc in matched_docs_new:
print(doc.page_content)
print(' ')


# OUTPUT

In 1997, Jobs returned to Apple as CEO after the company's acquisition of NeXT. He was largely responsible for reviving Apple, which was on the verge of bankruptcy. He worked closely with British designer Jony Ive to develop a line of products and services that had larger cultural ramifications,

In 1985, Jobs departed Apple after a long power struggle with the company's board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets,

On October 5, 2011, at the age of 56, Steve Jobs, the CEO of Apple, died due to complications from a relapse of islet cell neuroendocrine pancreatic cancer. Powell Jobs inherited the Steven P. Jobs Trust, which as of May 2013 had a 7.3% stake in The Walt Disney Company worth about $12.1 billion,

conducted by Sorkin. The film covers fourteen years in the life of Apple Inc. co-founder Steve Jobs, specifically ahead of three press conferences he gave during that time - the formal unveiling of the Macintosh 128K on January 24, 1984; the unveiling of the NeXT Computer on October 12, 1988; and

Conclusion

Hypothetical Document Embeddings)HyDE overcomes the limitations of RAG by improving the retrieval accuracy and reducing hallucinations. It works better than other retrievers in most the cases only when the LLM has some knowledge about the asked question. If the LLM has no clue of the asked question, the results can be quite messy. So, look at the responses of the LLM for the sample set of the questions before proceeding with it.

Connect with me on LinkedIn if you have any questions.

References

https://python.langchain.com/v0.1/docs/templates/hyde/

--

--