Leveraging RAG for Chatbot Development using Gemma-2b-it, FAISS VectorDB and Streamlit

Jayant Pal
6 min readApr 5, 2024

--

Large Language Models(LLMs) have significantly impacted the landscape of AI through several ways such as Enhanced NLP, Revolutionizing Content Generation and Democratization of AI among others.

While LLMs offer impressive capabilities, they’re not without limitations. Their knowledge base is limited to their recent training iteration. This means they might miss out on recent developments. Additionally, for tasks requiring specific domain expertise, LLMs may resort to “hallucination” — essentially making up information that sounds plausible but isn’t factually accurate.

While retraining the LLM could address these problems, the immense cost and time involved make it impractical. We need a more efficient solution. Let’s talk about the solution, then!

RAG(Retrieval Augmented Generation)

RAG emerges as a promising approach for the issues mentioned above. Using RAG, we can give the model access to specific information that be used by the model as context to generate responses.

In this article, we will build a LLM chatbot using Google’s Open Source model Gemma, along with LangChain, FAISS Vector Database and Streamlit.

Let’s first look into an overview of the RAG pipeline.

Image Source: https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines/

The process begins with a knowledge base such as PDFs and these are stored as a documents. These documents are then preprocessed by being chunked into smaller pieces that fits into the model context length.

Then we transform the chunked documents into embedding vectors using an embedding model and store it in the vector database.

When the user submits a query, we first convert the query into an embedding vector using the same embedding model. The system then searches the vector database for similar vectors based on the query vector.

The retrieval step in RAG essentially identifies relevant information from the knowledge base.

Now, these information becomes the context for the LLM. Finally the query is augmented with this retrieved context, and then passed to the LLM to generate responses.

With all said, let’s start building the chatbot. I am using Google Colab Free tier GPU for this entire pipeline.

We will use Gemma-2B model as the generator LLM.

Gemma-2B

Gemma-2B is a variant of the LLM Gemma that is developed by Google AI. It belongs to the family of decoder-only text-to-text LLMs, available in English, with open weights, pre-trained variants, and instruction-tuned variants. To use this model, we should accept the terms and conditions in HuggingFace and pass the HuggingFace token while logging in.

from huggingface_hub import notebook_login
notebook_login()

Let’s initialize the model. We also use 4-bit quantization(for which we need a GPU) to reduce the memory usage. We use BitsAndBytesConfig from Transformers for the same. We also cache the models in the drive for faster inference.

Note: We can also load the model in CPU without quantization using disk offloading.

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

CACHE_DIR = "/content/drive/LLM_RAG_Bot/models"

class ChatModel:
def __init__(self, model_id: str = "google/gemma-2b-it", device="cuda"):


self.tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=CACHE_DIR)

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

self.model = AutoModelForCausalLM.from_pretrained(model_id,device_map="auto",quantization_config=quantization_config,cache_dir=CACHE_DIR)

self.model.eval()
self.chat = []
self.device = device

Now, let us define the main inference function.

    def inference(self, question: str, context: str = None, max_new_tokens: int = 250):

if context == None or context == "":
prompt = f"""Give a detailed answer to the question. Question: {question}"""
else:
prompt = f"""Using the information contained from the context, give a detailed answer to the question. Do
not add any extra information . Context: {context}. Question: {question}"""

chat = [{"role": "user", "content": prompt}]
formatted_prompt = self.tokenizer.apply_chat_template(chat,tokenize=False,add_generation_prompt=True,)

inputs = self.tokenizer.encode(formatted_prompt, add_special_tokens=False, return_tensors="pt").to(self.device)

with torch.no_grad():
outputs = self.model.generate(input_ids=inputs,max_new_tokens=max_new_tokens,do_sample=False,)

response = self.tokenizer.decode(outputs[0], skip_special_tokens=False)
response = response[len(formatted_prompt) :]
response = response.replace("<eos>", "")

return response

The inference function takes a question and context(if present) and generates response. We initialize the prompt based on the context.

Then we encode the prompt and feed it to the LLM along with other parameters. Next, we decode the output back into text by decoding the tokens.

Embedding model

Now, our model for inference is loaded. As I mentioned before, we need an embedding model to transform the documents(context) as well as the query into vectors. For this we use a sentence transformer encoder model “all-MiniLM-L12-v2”, which encodes text into a 384-dimensional vector.

You can use other embedding models such as OpenAIEmbeddings, BGEEmbeddings, etc. Let’s initialise the model.

CACHE_DIR = "/content/drive/LLM_RAG_Bot/models"
class Encoder:
def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L12-v2", device="cpu"):
self.embedding_function = HuggingFaceEmbeddings(
model_name=model_name,
cache_folder=CACHE_DIR,
model_kwargs={"device": device})

With the inference LLM and embedding model set, lets look into the ingestion part.

Document Loaders and Splitters

I mentioned that to provide the model some context around the query, we will need some information. We will essentially load a PDF document, that will act as context.

I have briefly covered about Document Loaders and Splitters in my articles. I would suggest you to have a read about it. Let’s define the loader and split functionality.

def load_and_split_pdfs(file_paths: list, chunk_size: int = 256):
loaders = [PyPDFLoader(file_path) for file_path in file_paths]
pages = []
for loader in loaders:
pages.extend(loader.load())

text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L12-v2"),
chunk_size=chunk_size,
chunk_overlap=int(chunk_size / 10),
strip_whitespace=True)

docs = text_splitter.split_documents(pages)
return docs

Vector Database

After loading and chunking the documents, we need to store it in a vector database. I have used FAISS from Meta AI, which is very efficient for similarity search and indexing.

class FaissDb:
def __init__(self, docs, embedding_function):
self.db = FAISS.from_documents(docs, embedding_function, distance_strategy=DistanceStrategy.COSINE)

def similarity_search(self, question: str, k: int = 3):
retrieved_docs = self.db.similarity_search(question, k=k)
context = "".join(doc.page_content + "\n" for doc in retrieved_docs)
return context

To compute the similarity, I have used cosine similarity.

Streamlit UI

I have used Streamlit to create a user interface for the chatbot. We need to create an app.py file that contains all the required classes and functionalities.

First, we are loading the LLM and embedding model using a decorator and initializing them. Then we define a helper function to save all documents to disk.

import os
import streamlit as st
from model import ChatModel
import rag_util

FILES_DIR = "/content/drive/LLM_RAG_Bot/files"

st.title("Gemma 2B Chatbot")

@st.cache_resource
def load_model():
model = ChatModel(model_id="google/gemma-2b-it", device="cuda")
return model

@st.cache_resource
def load_encoder():
encoder = rag_util.Encoder(model_name="sentence-transformers/all-MiniLM-L12-v2", device="cpu")
return encoder

model = load_model()

encoder = load_encoder()

def save_file(uploaded_file):
"""helper function to save documents to disk"""
file_path = os.path.join(FILES_DIR, uploaded_file.name)
with open(file_path, "wb") as f:
f.write(uploaded_file.getbuffer())
return file_path

Next, we create a sidebar where we specify the user can upload any PDF and mention max_tokens and top k responses. After the document is uploaded, we load and split it, and store it in the FAISS DB.

with st.sidebar:
max_new_tokens = st.number_input("max_new_tokens", 128, 4096, 512)
k = st.number_input("k", 1, 10, 3)
uploaded_files = st.file_uploader("Upload PDFs for model context", type = ['pdf','PDF'], accept_multiple_files = True)

file_paths = []
for file in uploaded_files:
file_paths.append(save_file(file))

if uploaded_files != []:
# create DB
docs = rag_util.load_and_split_pdfs(file_paths)
DB = rag_util.FaissDb(docs = docs, embedding_function = encoder.embedding_function)

We then take in the user prompt, find the context using similarity search. This information is passed onto the model to generate response to the query.

with st.chat_message('assistant'):
user_prompt = st.session_state.messages[-1]['content']

context = (None if uploaded_files == [] else DB.similarity_search(user_prompt, k = k))

answer = model.generate(user_prompt, context = context, max_new_tokens = max_new_tokens)

response = st.write(answer)

st.session_state.messages.append({'role':'assistant','content':answer})
The Chatbot we build in this article using Gemma & LangChain that answers questions related to a refrigerator user manual.

When asked about a specific question, this is how the LLM responded by adding the context.

So, this is how we can create a custom chatbot that can read in PDF document and answer domain specific questions. The entire code is available on my Github.

Let me know if you any questions. Keep learning!

References

[1] https://huggingface.co/google/gemma-2b-it

[2] https://streamlit.io/generative-ai

--

--

Jayant Pal
Jayant Pal

Written by Jayant Pal

Data Scientist @ Euromonitor | Learner | Investor | Ardent Sports Fan | Github: https://github.com/jayantkp

No responses yet