This application uses Streamlit to create an interactive chatbot capable of answering questions based on the content of PDF, DOCX, and PPTX files. It uses LangChain for conversation management and FAISS for vector search.
- 📄 Content extraction from PDF, DOCX, and PPTX files.
- 🤖 Intelligent chatbot powered by a LLM model using
LlamaCpp
. - 🔎 Contextual search through integration with
FAISS
. - 📂 Multi-document support to query multiple files simultaneously.
- 💾 Conversation download to save the chat history.
Before running the application, make sure you have the following installed:
- Python 3.9+
- pip for managing Python packages
- Streamlit for the user interface
- LangChain, HuggingFaceEmbeddings, FAISS, LlamaCpp, PyPDFLoader, and python-docx for document processing and chatbot.
-
Clone the repository:
git clone https://github.com/your-username/your-repository.git cd your-repository
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On MacOS/Linux .\venv\Scripts\activate # On Windows
-
Install the dependencies:
pip install -r requirements.txt
The main modules used in this project are:
- Streamlit: For creating the interactive user interface.
- LangChain: For managing the conversations with the AI.
- FAISS: For efficient vector search.
- HuggingFaceEmbeddings: For creating embeddings from the text.
- LlamaCpp: For the LLM model used in the chatbot.
- PyPDFLoader: For extracting text from PDF files.
- python-docx: For handling DOCX files.
- python-pptx: For extracting text from PPTX files.
You can install them manually or by using the requirements.txt
file.
-
Run the Streamlit application:
streamlit run app.py
-
Open your browser and go to the following URL:
http://localhost:8501
-
Upload one or more files (PDF, DOCX, PPTX) from the sidebar.
-
Ask a question in the text box and the chatbot will respond based on the content of the uploaded files.
-
Download the conversation using the provided button.
- File Upload: The user uploads PDF, DOCX, or PPTX files.
- Text Extraction: Text is extracted using
PyPDFLoader
,python-docx
, andpython-pptx
. - Vectorization: The text is split into chunks and transformed into embeddings using
HuggingFaceEmbeddings
. - Contextual Search: Embeddings are stored in a FAISS index for fast retrieval.
- Conversational Chatbot:
LlamaCpp
is used to generate contextual responses based on user queries.
The chatbot uses the model mistral-7b-instruct-v0.1.Q4_K_M.gguf
with LlamaCpp
. Be sure to download the model from HuggingFace and place it in your project directory. Update the model path in the code:
model_path = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
Please do not forget to download the model from Hugging Face before running the chatbot. The model is required for the chatbot to work properly.
You can download the model mistral-7b-instruct-v0.1.Q4_K_M.gguf from the following Hugging Face link:
Once downloaded, place the model file in the project directory and update the model_path
in the code accordingly:
model_path = "path/to/your/mistral-7b-instruct-v0.1.Q4_K_M.gguf"