- 1. Problem
- 2. How We Try to Solve It
- 3. How It Works
- 4. Local Setup
- 5. Code Structure
- 6. Evaluation Metrics
- 7. Contribution Guide
- 8. Acknowledgements
Academic institutions like BITS Pilani maintain extensive research regulations and guidelines distributed across multiple PDF documents. This creates several critical challenges:
- Information Fragmentation: Research guidelines are scattered across multiple documents, making it difficult to find specific information quickly
- Manual Search Overhead: Significant time is wasted manually searching through lengthy documents
- Lack of Explainability: Need for answers with clear source attribution for verification
- Accessibility Barrier: Documents are not easily searchable or accessible in a user-friendly format
Our solution implements a Search and Answering system specifically designed for BITS research regulations over 100's of documents using:
- Document Processing Pipeline: Automated system to process and structure official BITS PDF documents
- Hybrid Search Architecture: Combines semantic understanding with keyword matching(sparse embeddings) for accurate retrieval
- RAG (Retrieval Augmented Generation): Uses large language models (LLM's) to generate natural, contextual answers
- Source Attribution: Every answer includes links to source documents for verification
- User-Friendly Interface: Clean, modern web interface for asking questions and viewing answers
The system operates through three main stages:
- Document Processing: PDF ingestion → Text extraction → Cleaning → Chunking → Embedding generation → Indexing and storage.
- Information Retrieval: Query processing → Hybrid search (dense + sparse retrieval) → Reranking → Context selection
- Answer Generation: Context merging → Answer generation using LLM (ex: gpt-4-turbo) → Source attribution → Response formatting
-
Document Sources:
- Official BITS Pilani Website.(AGSRD, PhD guidelines etc.)
-
Collection Process:
- Automated PDF downloading from official sources
- Document validation and metadata extraction
-
Text Extraction:
- PyMuPDF for robust PDF parsing
- Table and figure handling
- Layout preservation where relevant
-
Cleaning Pipeline:
- Unicode normalization
- Special character handling
- Whitespace normalization
- Header/footer removal
- Bullet point standardization
-
Chunk Creation:
- Recursive text splitter for creating chunk of text from documents.
- Overlap handling for context preservation (Chunk overlap)
-
Metadata:
- Document name
- Page numbers
- Document source/link (later to be used as reference in final generated answer).
- Model: OpenAI's text-embedding-3-large
- Vector Size: 1536 dimensions
- Purpose: Capture semantic meaning and contextual understanding
- Similarity Metric: Cosine similarity between query and document vectors
- Model: SPLADE++ (prithvida/Splade_PP_en_v1) - from Huggingface
- Vector Type: High-dimensional sparse vectors.
- Purpose: Exact term matching and vocabulary expansion
-
Query Processing:
- Query analysis and preprocessing
- Dense and sparse embedding generation
-
Hybrid Search:
- Parallel dense and sparse vector search
- Score fusion using Reciprocal Rank Fusion
- Initial candidate set generation (top-k)
- Model: Cross-encoder reranking (ms-marco-MiniLM-L-6-v2)
-
Context Processing:
- Merging relevant chunks
-
LLM Integration:
- LLM model integration like gpt-4-turbo
- Custom System prompt to guide LLM for generating answers based on relevant retrieved context.
-
Response Formatting:
- Answer structuring
- Source attribution addition
- Error handling
-
Fork the repository:
# Visit https://github.com/HarshJ23/CS_F469_Information_retreival_Project # Click on 'Fork' button at top-right
-
Clone your forked repository:
git clone https://github.com/YOUR-USERNAME/CS_F469_Information_retreival_Project.git cd CS_F469_Information_retreival_Project
- Python Environment:
python -m venv venv
.\venv\Scripts\activate # Windows
pip install -r requirements.txt
-
OpenAI API Setup:
- Create an account on OpenAI Platform
- Navigate to API Keys section
- Click 'Create new secret key'
- Copy the generated API key
- Add to
.env
file:OPENAI_API_KEY=your_key_here
-
Qdrant Setup:
- Create an account on Qdrant Cloud
- Click 'Create cluster' and select free tier
- After cluster creation, a dialog will appear with credentials
- Copy the API key from the dialog
- For the URL:
- Extract your cluster ID from the URL (e.g., if your URL is
https://cloud.qdrant.io/accounts/<cluster-id-string>/get-started
) - Format the Qdrant URL as:
https://<your-cluster-id>.eu-west-2-0.aws.cloud.qdrant.io:6333
- Extract your cluster ID from the URL (e.g., if your URL is
- Add to
.env
file:QDRANT_URL=https://<your-cluster-id>.eu-west-2-0.aws.cloud.qdrant.io:6333 QDRANT_API_KEY=your_qdrant_api_key
-
Database Setup:
# Initialize Qdrant collection
python document_processing_hybrid.py
- Start Server:
uvicorn main_api:app --reload --port 8000
- Node.js Environment:
cd frontend
npm install
- Environment Variables:
# Create .env.local in frontend directory
NEXT_PUBLIC_API_URL=http://localhost:8000
- Development Server:
npm run dev
CS_F469_Information_retreival_Project/
├── backend/
│ ├── main_api.py # FastAPI application
│ ├── document_processing_hybrid.py # Document processing
│ ├── query_handler_hybrid.py # Query processing
│ ├── query_handler_hybrid_rerank.py # Reranking logic
│ └── evaluate_api.py # Evaluation scripts
├── frontend/
│ ├── app/ # Next.js pages
│ ├── components/ # React components
│ └── public/ # Static assets
└── docs/ # Documentation
Data Flow:
- User submits question → Frontend
- Frontend sends API request → Backend
- Backend processes query:
- Generates embeddings
- Performs hybrid search
- Reranks results
- Generates answer
- Response returned → Frontend
- Frontend renders formatted answer
We evaluate the quality of generated answers using three complementary metrics that assess different aspects of text similarity:
- What it measures: Precision-focused metric that compares n-gram overlap between generated and reference answers
- How it works:
- Counts matching n-grams (1-4 words) between generated and reference text
- Applies brevity penalty for short answers
- Combines scores from different n-gram lengths
- Score range: 0 to 1 (higher is better)
- What it measures: Identifies the longest sequence of matching words, allowing for gaps
- How it works:
- Finds longest common subsequence between texts
- Calculates precision (generated text accuracy)
- Calculates recall (reference text coverage)
- Combines into F1 score
- Advantages:
- More flexible than BLEU for word order
- Better at handling paraphrasing
- Captures sentence structure similarity
- Score range: 0 to 1 (higher is better)
- What it measures: Semantic similarity using contextual embeddings
- How it works:
- Generates BERT embeddings for each word
- Computes cosine similarity between words
- Finds optimal word alignments
- Calculates precision and recall using soft token matching
- Advantages:
- Captures semantic meaning beyond exact matches
- Handles synonyms and paraphrasing well
- Correlates better with human judgments
- Score range: 0 to 1 (higher is better)
- Fork the repository
- Create a feature branch
- Set up development environment
- Make changes
- Submit pull request
-
Documentation:
- Update README if needed
- Document new functions/classes
- Add inline comments for complex logic
-
Version Control:
- Clear commit messages
- One feature per branch
- Rebase before PR
This project was developed as part of the problem statement given in the course - CS F469 Information Retrieval by Prof. Prajna Upadhyay at BITS Pilani Hyderabad Campus.
Team member - Mehul Kochar.