RAG Q&A Chatbot using OpenAI, LangChain, ChromaDB and Gradio

4 min readJun 5, 2024

Retrieval-Augmented Generation(RAG) emerges as a promising approach that handles the limitations of Large Language Models(LLMs) mainly hallucinating information and inconsistent outputs. Using RAG, we can give the model access to specific information that can be used by the model as context to generate responses that are more factually correct and more consistent.

Let’s take a look at the RAG pipeline.

The process begins with a data source which can be text data such as PDFs that are loaded and stored as documents. During preprocessing, the documents are segmented or chunked into pieces that align with the model’s context window.

The segmented documents are fed into an embedding model, which generates vector representations of the text. These vectors are then saved in a vector database that can be used during the generation process.

Upon receiving a user query, the system employs the same embedding model to transform the query into a vector representation. This vector is then used to search the vector database for entries containing similar vector representations, indicating potentially relevant information.

The retrieval step in RAG essentially identifies relevant information from the vector database. We will talk more about this step as this forms an essential component in RAG.

The retrieved information becomes the context that provides the LLM with additional knowledge relevant to the user’s query. Finally, the original query is augmented, or enriched, with this context before being fed to the LLM for response generation, which ensures that the output is factually more correct.

What is Gradio?

Gradio is an open source Python library that simplifies the process of creating user interfaces for ML models, APIs, etc. In just a few lines of code, we can build a web interface that allows people to interact with the model.

Let’s get our hands dirty and start building a Q&A chatbot using RAG capabilities.

The first step is to import all necessary dependencies.

import gradio as gr
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chat_models import ChatOpenAI

from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.prompts.chat import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain.vectorstores import Chroma

We will discuss about all the dependencies as we move forward.

We would need to have the OpenAI API key to call the ChatOpenAI model. You can get it in their website.

Let’s create an instance of the chat model. Also let’s define the embedding function that we will be using to store the vector representations in the vector store.

BGE models on HuggingFaceare one of the best open source embedding models.

chat = ChatOpenAI()
embedding_function = HuggingFaceEmbeddings(model_name = "BAAI/bge-large-en-v1.5",model_kwargs={'device': 'cpu'},encode_kwargs={"normalize_embeddings": True})

Now let’s define a function add_docs() that would take in a file(say a PDF), load and split it using RecursiveCharacterTextSplitter with chunk size 500 and chunk overlap 100. Then it would embed the documents using the embedding function and store it in Chroma VectorDB.

We will be using the same database during generation.

def add_docs(path):

    loader = PyPDFLoader(file_path=path)
    docs = loader.load_and_split(text_splitter=RecursiveCharacterTextSplitter(chunk_size = 500,
                                                                                chunk_overlap = 100,
                                                                                length_function = len,
                                                                                is_separator_regex=False))
    model_vectorstore = Chroma
    db = model_vectorstore.from_documents(documents=docs,embedding= embedding_function, persist_directory="output/general_knowledge")
    return db

Retrieval Process

For the retrieval process, we will use a combination of Contextual Compressor and Multi Query Retriever. I have discussed about these in my article.

We will define a function that first loads the vector store. It takes in the query from the user, creates the context using the query, augments the context to the prompt and generates a response. Since we are creating a chatbot, we also append the response to the chat history.

We use prompting techniques in LangChain to create the prompt templates for generation.

def answer_query(message, chat_history):
    base_compressor = LLMChainExtractor.from_llm(chat)
    db = Chroma(persist_directory = "output/general_knowledge", embedding_function=embedding_function)
    base_retriever = db.as_retriever()
    mq_retriever = MultiQueryRetriever.from_llm(retriever = base_retriever, llm=chat)
    compression_retriever = ContextualCompressionRetriever(base_compressor=base_compressor, base_retriever=mq_retriever)

    matched_docs = compression_retriever.get_relevant_documents(query = message)

    context = ""

    for doc in matched_docs:
        page_content = doc.page_content
        context+=page_content
        context += "\n\n"
    template = """
    Answer the following question only by using the context given below in the triple backticks, do not use any other information to answer the question.
    If you can't answer the given question with the given context, you can return an emtpy string ('')
    Context: ```{context}```
    ----------------------------
    Question: {query}
    ----------------------------
    Answer: """
    
    human_message_prompt = HumanMessagePromptTemplate.from_template(template=template)
    chat_prompt = ChatPromptTemplate.from_messages([human_message_prompt])
    prompt = chat_prompt.format_prompt(query = message, context = context)
    response = chat(messages=prompt.to_messages()).content

    chat_history.append((message,response))
    return "", chat_history

With all done, let’s define a function that creates the UI where in we first upload a file. We also create a textbox wherein the user enters a query and finally the answer_query() function which returns the response and adds it to the chat history.

with gr.Blocks() as demo:
    gr.HTML("<h1 align = 'center'>Smart Assistant</h1>")

    with gr.Row():

        upload_files = gr.File(label = 'Upload a PDF',file_types=['.pdf'],file_count='single')

    chatbot = gr.Chatbot()
    msg = gr.Textbox(label = "Enter your question here")
    upload_files.upload(add_docs,upload_files)
    msg.submit(answer_query,[msg,chatbot],[msg,chatbot])

After testing this, we can deploy it to Spaces by HuggingFace. We first create a Space and next we need to upload two files: app.pyand requirements.txtin the files section of the space. Now, if the build gets successful, we will have our interface ready.

This is how the interface looks like.

You can try out the app here:

Smart Assistant - a Hugging Face Space by jayant012

Discover amazing ML apps made by the community

huggingface.co

I have added all the scripts here.

Connect with me

RAG Q&A Chatbot using OpenAI, LangChain, ChromaDB and Gradio

Smart Assistant - a Hugging Face Space by jayant012

Discover amazing ML apps made by the community

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Jayant Pal

No responses yet

More from Jayant Pal

RAG using LangChain : Part 3- Vector Stores and Retrievers

Welcome to the third article in this series of RAG where we explore the different components in the Retrieval process in RAG. So far we…

RAG using LangChain : Part 2- Text Splitters and Embeddings

The next step in the Retrieval process in RAG is to transform and embed the loaded Documents. If you are not familiar with how to load raw…

RAG using LangChain : Part 1-Document Loaders

Welcome to a new series of articles on LangChain and LLMs. In this series, we will be learning about RAG in LLMs. Specifically in this…

RAG using LangChain : Part 4-Retrievers

In the previous article, we touched upon Vector Stores and Retrievers. In this article, we will briefly discuss about the different types…

Recommended from Medium

Retrieval Augmented Generation (RAG) with Ollama (llama3.2) and ChromaDB with Python Code

In the world of AI, especially within the domain of natural language processing (NLP), it’s not enough to rely solely on a large language…

Document Loaders in LangChain: A Component of RAG System

Explore how to load different types of data and convert them into Documents to process and store in a Vector Database.

Building a RAG-Enhanced Conversational Chatbot Locally with Llama 3.2 and Ollama

Create Your Own Local Chatbot With RAG — Step by step. Part 1

In this three parts tutorial, we’ll walk you through building a local chatbot from scratch using Python. We’ll also enhance it with RAG…

How to build chatbot with a local LLM in 5 minutes

Generated by ChatGPT

How to Implement RAG with ChromaDB and Ollama: A Python Guide for Beginners

Overview of Retrieval-Augmented Generation (RAG)