RAG using LangChain : Part 4-Retrievers
In the previous article, we touched upon Vector Stores and Retrievers. In this article, we will briefly discuss about the different types of retrievers in LangChain.
A retriever is an interface that returns documents given an unstructured query. It does not have to store documents like Vector Store. Retrievers accept a string query as an input and return a list of Documents as an output. I talked about Vector-Store retriever and BM-25 Retriever in the previous article. Let’s explore few other retrievers.
Semantic Retrievers
Semantic Retrievers focus on understanding the underlying context of a query and documents in order to retrieve the relevant information from the database. Semantic Retrievers leverage word embeddings and sentence encoders to capture the semantic meaning of the text. Let’s look into few of these.
- Multi Query Retriever
MultiQueryRetriever
automates the process of prompt tuning. As the name suggests, it essentially uses an LLM to generate multiple queries for a given user input query. For each of the query, it retrieves a set of relevant documents and taken a union across all queries to get a larger set of relevant documents. Let’s look into it’s functionality.
Import all necessary dependencies.
import chromadb
from dotenv import load_dotenv
from langchain.chat_models import ChatOpenAI
from langchain_community.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import Chroma
Let’s create a document using WikipediaLoader, split it, and embed using an embedding function and create a Chroma DB.
chunk_size = 400
chunk_overlap = 100
# loading environment variables
#load_dotenv()
import os
with open('../../openai_api_key.txt') as f:
api_key = f.read()
os.environ['OPENAI_API_KEY'] = api_key
# loading chat model
chat = ChatOpenAI()
# loading data
loader = WikipediaLoader(query="Steve Jobs", load_max_docs=5)
documents = loader.load()
# text splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap = chunk_overlap)
docs = text_splitter.split_documents(documents=documents)
# embedding function
embedding_function = HuggingFaceBgeEmbeddings(
model_name = "BAAI/bge-large-en-v1.5",
model_kwargs = {'device':'cpu'},
encode_kwargs = {'normalize_embeddings':True}
)
# vector store
db = Chroma.from_documents(docs, embedding_function, persist_directory="output/steve_jobs.db")
Creating an instance of the MultiQueryRetriever
. We need to pass the vector database and the LLM that we are using for query generation.
from langchain.retrievers.multi_query import MultiQueryRetriever
mq_retriever = MultiQueryRetriever.from_llm(retriever = db.as_retriever(), llm = chat)
query = "When was Steve Jobs fired from Apple?"
retrieved_docs = mq_retriever.get_relevant_documents(query=query)
retrieved_docs
# [Document(page_content='On October 5, 2011, at the age of 56, Steve Jobs, the CEO of Apple, died due to complications from a relapse of islet cell neuroendocrine pancreatic cancer. Powell Jobs inherited the Steven P. Jobs Trust, which as of May 2013 had a 7.3% stake in The Walt Disney Company worth about $12.1 billion, and 38.5 million shares of Apple Inc.As of July 2020, Powell Jobs and her family were ranked 59th in', metadata={'source': 'https://en.wikipedia.org/wiki/Laurene_Powell_Jobs', 'summary': 'Laurene Powell Jobs (née Powell; born November 6, 1963) is an American billionaire businesswoman and executive. She is the widow of Steve Jobs, co-founder and former CEO of Apple Inc., and she manages the Steve Jobs Trust. She is the founder and chair of Emerson Collective and XQ Institute. She is a major donor to Democratic Party politicians.', 'title': 'Laurene Powell Jobs'}),
# Document(page_content="Apple CEO John Sculley demands to know why the world believes he fired Jobs – Jobs was actually forced out by the Apple board, who were resolute on updating the Apple II following the Macintosh's lackluster sales. Despite Sculley's warnings, Jobs criticized the decision and dared them to cast a final vote on his tenure. After Hoffman and Jobs discuss NeXT's unclear direction, she realizes Jobs", metadata={'source': 'https://en.wikipedia.org/wiki/Steve_Jobs_(film)', 'summary': "Steve Jobs is a 2015 biographical drama film directed by Danny Boyle and written by Aaron Sorkin. A British-American co-production, it was adapted from the 2011 biography by Walter Isaacson and interviews conducted by Sorkin. The film covers fourteen years in the life of Apple Inc. co-founder Steve Jobs, specifically ahead of three press conferences he gave during that time - the formal unveiling of the Macintosh 128K on January 24, 1984; the unveiling of the NeXT Computer on October 12, 1988; and the unveiling of the iMac G3 on May 6, 1998. Jobs is portrayed by Michael Fassbender, with Kate Winslet as Joanna Hoffman and Seth Rogen, Katherine Waterston, Michael Stuhlbarg, and Jeff Daniels in supporting roles.\nDevelopment began in 2011 after the rights to Isaacson's book were acquired. Filming began in January 2015. A variety of actors were considered and cast before Fassbender eventually took the role. Editing was extensive on the project, with editor Elliot Graham starting while the film was still shooting. Daniel Pemberton served as composer, with a focus on dividing the score into three distinguishable sections.\nSteve Jobs premiered at the 2015 Telluride Film Festival on September 5, 2015, and began a limited release in New York City and Los Angeles on October 9, 2015. It opened nationwide in the U.S. on October 23, 2015, to widespread critical acclaim, with Boyle's direction, visual style, Sorkin's screenplay, musical score, cinematography, editing and the acting of Fassbender and Winslet garnering unanimous acclaim. However, it was a financial disappointment, grossing only $34 million worldwide against a budget of $30 million. People close to Jobs such as Steve Wozniak and John Sculley praised the performances, but the film also received criticism for historical inaccuracy. Steve Jobs was nominated for Best Actor (Fassbender) and Best Supporting Actress (Winslet) at the 88th Academy Awards, and received numerous other accolades.", 'title': 'Steve Jobs (film)'}),
# Document(page_content="In 1985, Jobs departed Apple after a long power struggle with the company's board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets, serving as its CEO. In 1986, he helped develop the visual effects industry by funding the computer", metadata={'source': 'https://en.wikipedia.org/wiki/Steve_Jobs', 'summary': 'Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American businessman, inventor, and investor best known for co-founding the technology giant Apple Inc. Jobs was also the founder of NeXT and chairman and majority shareholder of Pixar. He was a pioneer of the personal computer revolution of the 1970s and 1980s, along with his early business partner and fellow Apple co-founder Steve Wozniak.\nJobs was born in San Francisco in 1955 and adopted shortly afterwards. He attended Reed College in 1972 before withdrawing that same year. In 1974, he traveled through India, seeking enlightenment before later studying Zen Buddhism. He and Wozniak co-founded Apple in 1976 to further develop and sell Wozniak\'s Apple I personal computer. Together, the duo gained fame and wealth a year later with production and sale of the Apple II, one of the first highly successful mass-produced microcomputers. Jobs saw the commercial potential of the Xerox Alto in 1979, which was mouse-driven and had a graphical user interface (GUI). This led to the development of the unsuccessful Apple Lisa in 1983, followed by the breakthrough Macintosh in 1984, the first mass-produced computer with a GUI. The Macintosh launched the desktop publishing industry in 1985 with the addition of the Apple LaserWriter, the first laser printer to feature vector graphics and PostScript.\nIn 1985, Jobs departed Apple after a long power struggle with the company\'s board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets, serving as its CEO. In 1986, he helped develop the visual effects industry by funding the computer graphics division of Lucasfilm that eventually spun off independently as Pixar, which produced the first 3D computer-animated feature film Toy Story (1995) and became a leading animation studio, producing over 27 films since.\nIn 1997, Jobs returned to Apple as CEO after the company\'s acquisition of NeXT. He was largely responsible for reviving Apple, which was on the verge of bankruptcy. He worked closely with British designer Jony Ive to develop a line of products and services that had larger cultural ramifications, beginning with the "Think different" advertising campaign, and leading to the iMac, iTunes, Mac OS X, Apple Store, iPod, iTunes Store, iPhone, App Store, and iPad. In 2003, Jobs was diagnosed with a pancreatic neuroendocrine tumor. He died of respiratory arrest related to the tumor in 2011, and in 2022, was posthumously awarded the Presidential Medal of Freedom.', 'title': 'Steve Jobs'})]
Contextual Compression
One major challenge with retrieval is that the information that is most relevant to the query may be buried in a document with lot of irrelevant text. Also passing the entire document can lead to more expensive LLM calls and poor results. This is where Contextual Compression comes into picture.
The main idea is to compress the documents based on the context of the query, so that only the relevant information is returned. To use a contextual compression retriever, we would need a base retriever and a document compressor to compress the documents. Let’s look into few document compressors.
LLMChainExtractor
It iterates over the initially returned documents and extract from each, only the content relevant to the query.
Let’s wrap the base retriever with a ContextualCompressionRetriever
. We will also add an LLMChainExtractor
. We can see that the result is much more accurate and specific.
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
retriever = db.as_retriever()
chat = ChatOpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(chat)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
compressed_docs = compression_retriever.get_relevant_documents(query = query)
print(compressed_docs[0].page_content)
# In 1985, Jobs departed Apple after a long power struggle with the company's board and its then-CEO, John Sculley.
LLMChainFilter
It uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents.
from langchain.retrievers.document_compressors import LLMChainFilter
compressor = LLMChainFilter.from_llm(chat)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
compressed_docs = compression_retriever.get_relevant_documents(query = query)
print(compressed_docs[0].page_content)
# In 1985, Jobs departed Apple after a long power struggle with the company's board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets, serving as its CEO. In 1986, he helped develop the visual effects industry by funding the computer
EmbeddingsFilter
Making an extra LLM call over the retrieved documents can be slow and expensive. The EmbeddingsFilter
provides a cheaper and faster option. It embeds the documents and query and only returns documents which have sufficiently similar embeddings to query.
from langchain.retrievers.document_compressors import EmbeddingsFilter
# using similarity threshold of 0.6
embeddings_filter = EmbeddingsFilter(embeddings=embedding_function, similarity_threshold=0.6)
compression_retriever = ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=retriever)
compressed_docs = compression_retriever.get_relevant_documents(query = query)
print(compressed_docs[0].page_content)
# In 1985, Jobs departed Apple after a long power struggle with the company's board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets, serving as its CEO. In 1986, he helped develop the visual effects industry by funding the computer
2. Parent Document Retriever
While splitting the documents for retrieval, there may be some conflicts.
- You may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.
- You want to have long enough documents that the context of each chunk is retained.
The ParentDocumentRetriever
splits and stores small chunks of data. It first fetches smaller chunks, looks upto the parent ids for those chunks and then returns those larger documents.
Sometimes, the full documents can be too big to want to retrieve them as is. In that case, what we really want to do is to first split the raw documents into larger chunks, and then split it into smaller chunks. We then index the smaller chunks, but on retrieval we retrieve the larger chunks.
from langchain.text_splitter import CharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
# This text splitter is used to create the parent documents
parent_splitter = CharacterTextSplitter(separator="\n\n", chunk_size=1000, chunk_overlap=100)
# This text splitter is used to create the child documents
child_splitter = CharacterTextSplitter(separator="\n", chunk_size=200, chunk_overlap=50)
store = InMemoryStore() # parent documents
par_doc_retriever = ParentDocumentRetriever(vectorstore=db, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter)
par_doc_retriever.add_documents(docs)
par_doc_retriever.get_relevant_documents(query=query)
# [Document(page_content="In 1985, Jobs departed Apple after a long power struggle with the company's board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets, serving as its CEO. In 1986, he helped develop the visual effects industry by funding the computer", metadata={'title': 'Steve Jobs', 'summary': 'Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American businessman, inventor, and investor best known for co-founding the technology giant Apple Inc. Jobs was also the founder of NeXT and chairman and majority shareholder of Pixar. He was a pioneer of the personal computer revolution of the 1970s and 1980s, along with his early business partner and fellow Apple co-founder Steve Wozniak.\nJobs was born in San Francisco in 1955 and adopted shortly afterwards. He attended Reed College in 1972 before withdrawing that same year. In 1974, he traveled through India, seeking enlightenment before later studying Zen Buddhism. He and Wozniak co-founded Apple in 1976 to further develop and sell Wozniak\'s Apple I personal computer. Together, the duo gained fame and wealth a year later with production and sale of the Apple II, one of the first highly successful mass-produced microcomputers. Jobs saw the commercial potential of the Xerox Alto in 1979, which was mouse-driven and had a graphical user interface (GUI). This led to the development of the unsuccessful Apple Lisa in 1983, followed by the breakthrough Macintosh in 1984, the first mass-produced computer with a GUI. The Macintosh launched the desktop publishing industry in 1985 with the addition of the Apple LaserWriter, the first laser printer to feature vector graphics and PostScript.\nIn 1985, Jobs departed Apple after a long power struggle with the company\'s board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets, serving as its CEO. In 1986, he helped develop the visual effects industry by funding the computer graphics division of Lucasfilm that eventually spun off independently as Pixar, which produced the first 3D computer-animated feature film Toy Story (1995) and became a leading animation studio, producing over 27 films since.\nIn 1997, Jobs returned to Apple as CEO after the company\'s acquisition of NeXT. He was largely responsible for reviving Apple, which was on the verge of bankruptcy. He worked closely with British designer Jony Ive to develop a line of products and services that had larger cultural ramifications, beginning with the "Think different" advertising campaign, and leading to the iMac, iTunes, Mac OS X, Apple Store, iPod, iTunes Store, iPhone, App Store, and iPad. In 2003, Jobs was diagnosed with a pancreatic neuroendocrine tumor. He died of respiratory arrest related to the tumor in 2011, and in 2022, was posthumously awarded the Presidential Medal of Freedom.', 'source': 'https://en.wikipedia.org/wiki/Steve_Jobs'})]
3. Time Weighted Vector Store Retriever
This uses a combination of semantic similarity and time decay. The algorithm used is:
semantic_similarity + (1.0 - decay_rate) ^ hours_passed #hours passed = hours passed since object in the retriever was last accessed
import faiss
from langchain.vectorstores import FAISS
from langchain.docstore import InMemoryDocstore
from langchain.retrievers import TimeWeightedVectorStoreRetriever
from langchain_community.embeddings import FakeEmbeddings
from langchain_core.documents import Document
Low Decay Rate
A low decay rate(close to 0) means that memories will be remembered for longer. A decay rate of 0 means memories will never be forgotten.
# define embedding model
embedding_function = FakeEmbeddings(size=300)
emb_size = 1024
# initialize empty vector store
index = faiss.IndexFlatL2(emb_size)
vector_store = FAISS(embedding_function, index, docstore=InMemoryDocstore({}), index_to_docstore_id = {})
tw_retriever = TimeWeightedVectorStoreRetriever(vectorstore = vector_store, decay_rate = 0.0000000000000000000000001, k=1)
from datetime import datetime, timedelta
yesterday = datetime.now() - timedelta(days=1)
retriever.add_documents(
[Document(page_content="hello world")]
)
retriever.add_documents([Document(page_content="hello foo")])
# "hello world" is returned first because it is most salient, and the decay rate is close to 0., meaning it's still recent enough
retriever.get_relevant_documents("hello world")
# [Document(page_content='hello world'),
# Document(page_content='hello foo')]
High Decay Rate
With a high decay rate
(several 9’s), the recency score
quickly goes to 0. If you set this all the way to 1, recency
is 0 for all objects, once again making this equivalent to a vector lookup.
embedding_function = FakeEmbeddings(size=300)
emb_size = 1024
index = faiss.IndexFlatL2(emb_size)
temp_db = FAISS(embedding_function, index, InMemoryDocstore({}),{})
tw_retriever = TimeWeightedVectorStoreRetriever(vectorstore = temp_db, decay_rate = 0.999, k=1)
yesterday = datetime.now() - timedelta(days=1)
retriever.add_documents(
[Document(page_content="hello world")]
)
retriever.add_documents([Document(page_content="hello foo")])
#"hello foo" is returned first because "hello world" is mostly forgotten
retriever.get_relevant_documents("hello world")
That’s all I had in this article. We covered some of the most important Retrievers in LangChain that form an essential component in the RAG pipeline.
All the code have been added my in this Github link. Let me know if you have questions.
References