RAG using LangChain : Part 2- Text Splitters and Embeddings
The next step in the Retrieval process in RAG is to transform and embed the loaded Documents. If you are not familiar with how to load raw text as documents using Document Loaders, I would encourage you to look into the first article.
How to do Transformation? — Text Splitters
Once we have loaded the documents, we may want to split the entire document into smaller chunks(or parts), that can fit into our model’s context window. This is where LangChain’s text splitters comes into picture!
But this is not as easy as it sounds, there may be a lot of complexity involved here. Ideally, we would want to keep the semantically related pieces of text together. By semantically, I mean texts have similar contextual meaning.
Let’s hop onto the different types of text splitters in LangChain.
Character Text Splitter
This is one of the simplest method. It splits the text based on a specific character only if the chunk exceeds the given chunk size.
#Let's read a text file
filepath = "../../Sessions_Part2\datasets\Harry Potter 1 - Sorcerer_s Stone.txt"
with open(filepath,'r') as f:
hp_book = f.read()
from langchain.text_splitter import CharacterTextSplitter
def len_func(text):
return len(text)
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size = 1200,
chunk_overlap = 100,
length_function = len_func,
is_separator_regex= False)
para_list = text_splitter.create_documents(texts = [hp_book])
para_list[:2]
[Document(page_content="Harry Potter and the Sorcerer's Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you'd expect to be involved in anything strange or mysterious,\nbecause they just didn't hold with such nonsense.\n\nMr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the\nneighbors. The Dursleys had a small son called Dudley and in their\nopinion there was no finer boy anywhere."),
Document(page_content="The Dursleys had everything they wanted, but they also had a secret, and\ntheir greatest fear was that somebody would discover it. They didn't\nthink they could bear it if anyone found out about the Potters. Mrs.\nPotter was Mrs. Dursley's sister, but they hadn't met for several years;\nin fact, Mrs. Dursley pretended she didn't have a sister, because her\nsister and her good-for-nothing husband were as unDursleyish as it was\npossible to be. The Dursleys shuddered to think what the neighbors would\nsay if the Potters arrived in the street. The Dursleys knew that the\nPotters had a small son, too, but they had never even seen him. This boy\nwas another good reason for keeping the Potters away; they didn't want\nDudley mixing with a child like that.\n\nWhen Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story\nstarts, there was nothing about the cloudy sky outside to suggest that\nstrange and mysterious things would soon be happening all over the\ncountry. Mr. Dursley hummed as he picked out his most boring tie for\nwork, and Mrs. Dursley gossiped away happily as she wrestled a screaming\nDudley into his high chair.\n\nNone of them noticed a large, tawny owl flutter past the window.")]
There is no metadata associated with the result,so we can add the same.
first_chunk = para_list[0]
first_chunk.metadata = {"source":filepath}
first_chunk.metadata
{'source': '../../Sessions_Part2\\datasets\\Harry Potter 1 - Sorcerer_s Stone.txt'}
One of the major issues with this text splitter is that it can be an issue if the text exceeds the chunk size and there is no separator to chunk the text. This is where the next separator is used.
Recursive Character Splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
separators=["\n\n","\n", " "],
chunk_size = 200,
chunk_overlap = 100,
length_function = len_func,
is_separator_regex=False
)
chunk_list = text_splitter.create_documents(texts = [hp_book]
Here, we first split at paragraph level, if the chunk size exceeds, it will move onto the next separator, at sentence level, if it still exceeds, it will move onto the next separator which is at word level.
Split by tokens
There is always a token limit associated with language models which we cannot exceed. When we split the text into chunks, it is a good idea to count the number of tokens.
tiktoken
tiktoken
is a fast BPE(Byte-Pair encoding) tokenizer created by OpenAI to count the number of tokens.
# Splitting based on the token limit
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
separator = "\n\n",
chunk_size = 1200,
chunk_overlap = 100,
is_separator_regex = False,
model_name='text-embedding-3-small', #used to calculate tokens
encoding_name='text-embedding-3-small'
)
doc_list = text_splitter.create_documents([hp_book])
doc_list # returns list of document objects
We can split the text, and return text chunks.
line_list = text_splitter.split_text(hp_book)
To convert the split text back to list of document objects
from langchain.docstore.document import Document
doc_list = []
for line in line_list:
curr_doc = Document(page_content = line, metadata = {"source":filepath})
doc_list.append(curr_doc)
Splitting by code
We can split codes written in any programming language. Here is an example using PythonTextSplitter.
python_code = """def peer_review(article_id):
chat = ChatOpenAI()
loader = ArxivLoader(query=article_id, load_max_docs=2)
data = loader.load()
first_record = data[0]
page_content = first_record.page_content
title = first_record.metadata['Title']
summary = first_record.metadata['Summary']
''''''''''''
''''''''''''
return response.content"""
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
text_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size = 50,
chunk_overlap = 10
)
text_splitter.create_documents(texts = [python_code])
[Document(page_content='def peer_review(article_id):'),
Document(page_content='chat = ChatOpenAI()'),
Document(page_content='loader = ArxivLoader(query=article_id,'),
Document(page_content='load_max_docs=2)'),
.....................................................................)]
After from all these text splitters, we also have splitters using NLTK,Spacy, Sentence Transformers, etc.
Text Embeddings
Embeddings are used to create a vector representation of the text. These are stored along with their corresponding text in the vector database. We use an embedding function to create embeddings of the documents.
The base Embeddings class in LangChain provides two methods: one for embedding documents(to be searched over) and one for embedding a query(the search query). The former takes as input multiple texts, while the latter takes a single text.
OpenAI Embeddings
import os
from langchain.embeddings import OpenAIEmbeddings
# setting up OPENAI API key as environment variable
with open("../../openai_api_key.txt") as f:
api_key = f.read()
os.environ['OPENAI_API_KEY'] = api_key
# creating an instance of OpenAIEmbeddings model
embeddings = OpenAIEmbeddings()
text = "The scar had not pained Harry for nineteen years. All was well."
embedded_text = embeddings.embed_query(text)
print(embedded_text[:5]), print(len(embedded_text))
#[-0.006067691294778975, -0.006654051049083575, 0.03343223365953213, -0.02039625470103048, -0.008338620781671906]
#1536
If we have multiple lines in the document.
from langchain.docstore.document import Document
# convert into document
doc_lines = [
Document(page_content=text, metadata = {"source":"Harry Potter"}),
Document(page_content="It is our choices, Harry, that show what we truly are, far more than our abilities", metadata = {"source":"Harry Potter"})
]
# to extract the page content
line_List = [doc.page_content for doc in doc_lines]
# embed the documents
embedded_docs = [embeddings.embed_query(text) for text in line_List]
np.array(embedded_docs).shape
# (2, 1536)
Now, OpenAI Embeddings are expensive. Let’s explore some best performing open source embedding models.
BGE Model( BAAI(Beijing Academy of Artificial Intelligence) General Embeddings) Model
BGE models on HuggingFace
are one of the best open source embedding models.
import numpy as np
from langchain.embeddings import HuggingFaceEmbeddings
model_name = "BAAI/bge-base-en-v1.5"
model_kwargs = {"device":'cpu'}
encode_kwargs = {'normalize_embeddings':True}
hf = HuggingFaceEmbeddings(
model_name = model_name,
model_kwargs = model_kwargs,
encode_kwargs = encode_kwargs
)
import numpy as np
embedded_docs = [hf.embed_query(text) for text in line_List]
np.array(embedded_docs).shape
# (2,768)
Fake Embeddings
If we have some hardware constraints, we can also use something Fake Embeddings from LangChain.
from langchain_community.embeddings import FakeEmbeddings
fake_embeddings = FakeEmbeddings(size = 300) # embedding size
fake_embedding_record = fake_embeddings.embed_query("This is a random text")
fake_embedding_records = fake_embeddings.embed_documents(["This is a random text"])
So,this is all I had regarding Text Splitters and Text Embeddings in Retrievers. There are a lot more information available in the official LangChain documentation.
I have added all the code in my Github repo. Let me know if you have questions. I will see you in the next one.
Keep Learning!
References