Document-Based Question Answering System with Semantic Search, Pinecone and OpenAI

5 min readApr 3, 2024

The project involves several components, including speech-to-text conversion, document processing, question answering, and text-to-speech synthesis.

Introduction:

Our project amalgamates cutting-edge technologies including speech recognition, natural language processing (NLP), and document retrieval into a unified system for streamlined question answering. Through the utilization of advanced speech recognition algorithms, we enable users to interact with the system effortlessly through spoken language, enhancing accessibility and user experience. This capability facilitates seamless integration between human input and computational processes, fostering a more intuitive and efficient interaction paradigm.

Tools and Libraries Used for the Project:

langchain: A framework for natural language processing tasks, including question answering and document processing.
langchain_community: A community-contributed module providing additional functionality for the langchain framework.
langchain_openai: A library interfacing with OpenAI services, particularly for embeddings and language models.
langchain_pinecone: A library facilitating integration with Pinecone, a vector search engine for similarity search.
Pinecone: A vector search engine used for efficient document retrieval based on embeddings.
gtts (Google Text-to-Speech): A Python library for text-to-speech conversion using Google’s Text-to-Speech API.
pyttsx3: A cross-platform text-to-speech library for Python, providing offline text-to-speech synthesis capabilities.

Let’s break down the provided code into sections and explain each part:

Importing Required Libraries:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone
from langchain_openai import OpenAI
import speech_recognition as sr
from gtts import gTTS 
import pyttsx3 
import time
import os

This section imports the necessary libraries and modules required for the project. It includes libraries for text processing (langchain, langchain_community), document loading (UnstructuredPDFLoader), embeddings (OpenAIEmbeddings), vector store (PineconeVectorStore), Pinecone itself for vector indexing and searching, OpenAI for language models and embeddings, speech recognition (speech_recognition), text-to-speech conversion (gtts, pyttsx3), time manipulation (time), and operating system interactions (os).

2. Setting API Keys and Environment Variables:

OPENAI_API_KEY=""
PINECONE_API_KEY=""
os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY

Here, empty strings are assigned to the API keys for OpenAI and Pinecone. These keys need to be replaced with actual API keys obtained from the respective services. Additionally, the Pinecone API key is set as an environment variable.

3. Definition of SpeaktoText Class:

class SpeaktoText:
    def SpeakText(self, command):
        # Initialize the engine
        engine = pyttsx3.init()
        engine.say(command) 
        engine.runAndWait()
        
    def getTextFromSpeech(self):
        # Initialize the recognizer
        r = sr.Recognizer() 
        while True:
            try:
                with sr.Microphone() as source:
                    print("Listening...")
                    r.adjust_for_ambient_noise(source, duration=0.2)
                    audio = r.listen(source)
                    print("Recognizing...")
                    text = r.recognize_google(audio)
                    text = text.lower()
                    return text
            except sr.UnknownValueError:
                print("Sorry, I did not understand that.")
            except sr.RequestError as e:
                print("Could not request results; {0}".format(e))
                
    def getSpeechFromText(self, textspeech: str):
        language = 'en'
        myobj = gTTS(text=textspeech, lang=language, slow=False) 
        myobj.save("farmerdata.mp3")
        time.sleep(5)
        return os.system("mpg321 farmerdata.mp3")

This class contains methods for converting text to speech (SpeakText), getting text from speech using speech recognition (getTextFromSpeech), and converting text to speech using Google Text-to-Speech (getSpeechFromText).

4. Loading and Preprocessing Documents:

file = "data/farmerbook.pdf"
loader = UnstructuredPDFLoader(file)
documents = loader.load()

This section loads a PDF document using the UnstructuredPDFLoader from the langchain_community module and preprocesses it into a list of documents.

5. Splitting Documents:

spliter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
data = spliter.split_documents(documents)

The loaded documents are split into smaller chunks of text to improve processing efficiency using the RecursiveCharacterTextSplitter from the langchain module.

6. Initializing OpenAI Embeddings and Pinecone Vector Store:

embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY)
index_name = "langchain3"
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(index_name)
index.delete(delete_all=True)
docsearch = PineconeVectorStore.from_documents(data, embeddings, index_name=index_name)
time.sleep(5)

OpenAI embeddings are initialized using the provided API key, and a Pinecone index is created using the specified index name. The document embeddings are generated and stored in the Pinecone index for efficient similarity search.

7. Instantiating SpeaktoText and OpenAI Objects, and Loading Question Answering Chain:

speaktotext = SpeaktoText()
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

Instances of the SpeaktoText class and the OpenAI object are created. Additionally, a question-answering chain is loaded using the OpenAI object and specified chain type.

8. Performing Speech Recognition, Document Retrieval, and Question Answering:

query = speaktotext.getTextFromSpeech()
docs = docsearch.similarity_search(query)
outputText = chain.invoke({"question": query, "input_documents": docs})['output_text']

The getTextFromSpeech method is used to obtain a query from the user via speech input. The obtained query is then used to retrieve relevant documents using document similarity search. Finally, the question-answering chain is invoked with the user query and retrieved documents, and the output text is obtained.

9. Converting Output Text to Speech:

speaktotext.getSpeechFromText(outputText)

The output text from the question-answering process is converted to speech using the getSpeechFromText method and played back to the user.

Final Code

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone
from langchain_openai import OpenAI
from langchain.chains.question_answering import load_qa_chain
import speech_recognition as sr
from gtts import gTTS 
import pyttsx3 
import time
import os

OPENAI_API_KEY=""
PINECONE_API_KEY=""
os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY

# Function to convert text to
# speech
class SpeaktoText:
    def SpeakText(self,command):
        
        # Initialize the engine
        engine = pyttsx3.init()
        engine.say(command) 
        engine.runAndWait()
        
    def getTextFromSpeech(self):
        r = sr.Recognizer() 
        while True:
            try:
                with sr.Microphone() as source:
                    print("Listening...")
                    r.adjust_for_ambient_noise(source, duration=0.2)
                    audio = r.listen(source)
                    print("Recognizing...")
                    text = r.recognize_google(audio)
                    text = text.lower()
                    return text
            except sr.UnknownValueError:
                print("Sorry, I did not understand that.")
            except sr.RequestError as e:
                print("Could not request results; {0}".format(e))
                
    def getSpeechFromText(self,textspeech:str):
        language = 'en'
        myobj = gTTS(text=textspeech, lang=language, slow=False) 
        myobj.save("farmerdata.mp3")
        time.sleep(5)
        return os.system("mpg321 farmerdata.mp3")

# load the pdf
file = "data/beta.pdf"
loader = UnstructuredPDFLoader(file)
documents = loader.load()
print(f'You have {len(documents)} document(s) in your data')
print(f'There are {len(documents[0].page_content)} characters in your document')

# The pdf is splitted in small chucks of documents
spliter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=0)
data = spliter.split_documents(documents)
print(f'Now you have {len(data)} documents')


embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY)
index_name="langchain3"
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(index_name)
index.delete(delete_all=True)

docsearch = PineconeVectorStore.from_documents(data, embeddings, index_name=index_name)
time.sleep(5)   
# initialise an object
speaktotext=SpeaktoText()
llm=OpenAI(temperature=0,openai_api_key=OPENAI_API_KEY)
chain=load_qa_chain(llm,chain_type="stuff")
query = speaktotext.getTextFromSpeech()
docs = docsearch.similarity_search(query)

outputText=chain.invoke({"question": query, "input_documents": docs})['output_text']
print(outputText)
speaktotext.getSpeechFromText(outputText)

Pinecone Server:

Conclusion

The project showcases the seamless integration of state-of-the-art technologies in natural language processing (NLP), speech recognition, document retrieval, and question answering. By leveraging these technologies, we have developed a robust system that enables users to interact with vast repositories of textual data using natural language input.
Through the use of speech recognition, users can effortlessly communicate their queries to the system, eliminating the need for manual input. The system then processes these spoken queries using advanced NLP techniques to understand user intent and extract relevant information.
Furthermore, our system incorporates sophisticated document retrieval mechanisms, allowing for efficient indexing and retrieval of relevant documents from large corpora. This ensures that users are presented with accurate and pertinent information in response to their queries.
The integration of question-answering models powered by advanced AI technologies further enhances the system’s capabilities. These models analyze both the user query and retrieved documents to generate precise and contextually relevant answers, thereby providing users with valuable insights and information.

GitHub - Kamalesh-Seervi/Document-Based-Question-Answering-System

Contribute to Kamalesh-Seervi/Document-Based-Question-Answering-System development by creating an account on GitHub.

github.com