Information Retrieval Project

Search and Answering system over BITS Pilani research regulations and policies.

1. Problem

Academic institutions like BITS Pilani maintain extensive research regulations and guidelines distributed across multiple PDF documents. This creates several critical challenges:

Information Fragmentation: Research guidelines are scattered across multiple documents, making it difficult to find specific information quickly
Manual Search Overhead: Significant time is wasted manually searching through lengthy documents
Lack of Explainability: Need for answers with clear source attribution for verification
Accessibility Barrier: Documents are not easily searchable or accessible in a user-friendly format

2. How We Try to Solve It

Our solution implements a Search and Answering system specifically designed for BITS research regulations over 100's of documents using:

Document Processing Pipeline: Automated system to process and structure official BITS PDF documents
Hybrid Search Architecture: Combines semantic understanding with keyword matching(sparse embeddings) for accurate retrieval
RAG (Retrieval Augmented Generation): Uses large language models (LLM's) to generate natural, contextual answers
Source Attribution: Every answer includes links to source documents for verification
User-Friendly Interface: Clean, modern web interface for asking questions and viewing answers

3. How It Works

3.1 Overview of Whole Pipeline

The system operates through three main stages:

Document Processing: PDF ingestion → Text extraction → Cleaning → Chunking → Embedding generation → Indexing and storage.
Information Retrieval: Query processing → Hybrid search (dense + sparse retrieval) → Reranking → Context selection
Answer Generation: Context merging → Answer generation using LLM (ex: gpt-4-turbo) → Source attribution → Response formatting

3.2 Corpus Collection

Document Sources:
- Official BITS Pilani Website.(AGSRD, PhD guidelines etc.)
Collection Process:
- Automated PDF downloading from official sources
- Document validation and metadata extraction

3.3 Data Processing and Text Cleaning

Text Extraction:
- PyMuPDF for robust PDF parsing
- Table and figure handling
- Layout preservation where relevant
Cleaning Pipeline:
- Unicode normalization
- Special character handling
- Whitespace normalization
- Header/footer removal
- Bullet point standardization

3.4 Chunking

Chunk Creation:
- Recursive text splitter for creating chunk of text from documents.
- Overlap handling for context preservation (Chunk overlap)
Metadata:
- Document name
- Page numbers
- Document source/link (later to be used as reference in final generated answer).

3.5 Indexing

Dense Embeddings

Model: OpenAI's text-embedding-3-large
Vector Size: 1536 dimensions
Purpose: Capture semantic meaning and contextual understanding
Similarity Metric: Cosine similarity between query and document vectors

Sparse Embeddings

Model: SPLADE++ (prithvida/Splade_PP_en_v1) - from Huggingface
Vector Type: High-dimensional sparse vectors.
Purpose: Exact term matching and vocabulary expansion

3.6 Retrieval

Query Processing:
- Query analysis and preprocessing
- Dense and sparse embedding generation
Hybrid Search:
- Parallel dense and sparse vector search
- Score fusion using Reciprocal Rank Fusion
- Initial candidate set generation (top-k)

3.7 Reranking

Model: Cross-encoder reranking (ms-marco-MiniLM-L-6-v2)

3.8 Answer Generation

Context Processing:
- Merging relevant chunks
LLM Integration:
- LLM model integration like gpt-4-turbo
- Custom System prompt to guide LLM for generating answers based on relevant retrieved context.
Response Formatting:
- Answer structuring
- Source attribution addition
- Error handling

4. Local Setup and Environment Configuration

Initial Repository Setup

Fork the repository:

# Visit https://github.com/HarshJ23/CS_F469_Information_retreival_Project
# Click on 'Fork' button at top-right

Clone your forked repository:

git clone https://github.com/YOUR-USERNAME/CS_F469_Information_retreival_Project.git
cd CS_F469_Information_retreival_Project

Backend Setup

Python Environment:

python -m venv venv
.\venv\Scripts\activate  # Windows
pip install -r requirements.txt

OpenAI API Setup:
1. Create an account on OpenAI Platform
2. Navigate to API Keys section
3. Click 'Create new secret key'
4. Copy the generated API key
5. Add to .env file:
```
OPENAI_API_KEY=your_key_here
```
Qdrant Setup:
1. Create an account on Qdrant Cloud
2. Click 'Create cluster' and select free tier
3. After cluster creation, a dialog will appear with credentials
4. Copy the API key from the dialog
5. For the URL:
  - Extract your cluster ID from the URL (e.g., if your URL is https://cloud.qdrant.io/accounts/<cluster-id-string>/get-started)
  - Format the Qdrant URL as: https://<your-cluster-id>.eu-west-2-0.aws.cloud.qdrant.io:6333
6. Add to .env file:
```
QDRANT_URL=https://<your-cluster-id>.eu-west-2-0.aws.cloud.qdrant.io:6333
QDRANT_API_KEY=your_qdrant_api_key
```
Database Setup:

# Initialize Qdrant collection
python document_processing_hybrid.py

Start Server:

uvicorn main_api:app --reload --port 8000

Frontend Setup

Node.js Environment:

cd frontend
npm install

Environment Variables:

# Create .env.local in frontend directory
NEXT_PUBLIC_API_URL=http://localhost:8000

Development Server:

npm run dev

5. Code Structure and Flow

CS_F469_Information_retreival_Project/
├── backend/
│   ├── main_api.py                    # FastAPI application
│   ├── document_processing_hybrid.py   # Document processing
│   ├── query_handler_hybrid.py        # Query processing
│   ├── query_handler_hybrid_rerank.py # Reranking logic
│   └── evaluate_api.py                # Evaluation scripts
├── frontend/
│   ├── app/                           # Next.js pages
│   ├── components/                    # React components
│   └── public/                        # Static assets
└── docs/                              # Documentation

Data Flow:

User submits question → Frontend
Frontend sends API request → Backend
Backend processes query:
- Generates embeddings
- Performs hybrid search
- Reranks results
- Generates answer
Response returned → Frontend
Frontend renders formatted answer

6. Evaluation Metrics

We evaluate the quality of generated answers using three complementary metrics that assess different aspects of text similarity:

BLEU Score

What it measures: Precision-focused metric that compares n-gram overlap between generated and reference answers
How it works:
- Counts matching n-grams (1-4 words) between generated and reference text
- Applies brevity penalty for short answers
- Combines scores from different n-gram lengths
Score range: 0 to 1 (higher is better)

ROUGE-L F1

What it measures: Identifies the longest sequence of matching words, allowing for gaps
How it works:
- Finds longest common subsequence between texts
- Calculates precision (generated text accuracy)
- Calculates recall (reference text coverage)
- Combines into F1 score
Advantages:
- More flexible than BLEU for word order
- Better at handling paraphrasing
- Captures sentence structure similarity
Score range: 0 to 1 (higher is better)

BERTScore F1

What it measures: Semantic similarity using contextual embeddings
How it works:
- Generates BERT embeddings for each word
- Computes cosine similarity between words
- Finds optimal word alignments
- Calculates precision and recall using soft token matching
Advantages:
- Captures semantic meaning beyond exact matches
- Handles synonyms and paraphrasing well
- Correlates better with human judgments
Score range: 0 to 1 (higher is better)

7. Contribution Guide

Getting Started

Fork the repository
Create a feature branch
Set up development environment
Make changes
Submit pull request

Development Guidelines

Documentation:
- Update README if needed
- Document new functions/classes
- Add inline comments for complex logic
Version Control:
- Clear commit messages
- One feature per branch
- Rebase before PR

8. Acknowledgements

This project was developed as part of the problem statement given in the course - CS F469 Information Retrieval by Prof. Prajna Upadhyay at BITS Pilani Hyderabad Campus.

Team member - Mehul Kochar.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
backend		backend
corpus_stats		corpus_stats
document_corpus		document_corpus
frontend		frontend
.gitignore		.gitignore
IR_test_dataset.csv		IR_test_dataset.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Information Retrieval Project

Search and Answering system over BITS Pilani research regulations and policies.

Table of Contents

1. Problem

2. How We Try to Solve It

3. How It Works

3.1 Overview of Whole Pipeline

3.2 Corpus Collection

3.3 Data Processing and Text Cleaning

3.4 Chunking

3.5 Indexing

Dense Embeddings

Sparse Embeddings

3.6 Retrieval

3.7 Reranking

3.8 Answer Generation

4. Local Setup and Environment Configuration

Initial Repository Setup

Backend Setup

Frontend Setup

5. Code Structure and Flow

6. Evaluation Metrics

BLEU Score

ROUGE-L F1

BERTScore F1

7. Contribution Guide

Getting Started

Development Guidelines

8. Acknowledgements

About

Uh oh!

Releases

Packages

Languages

HarshJ23/CS_F469_Information_retreival_Project

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval Project

Search and Answering system over BITS Pilani research regulations and policies.

Table of Contents

1. Problem

2. How We Try to Solve It

3. How It Works

3.1 Overview of Whole Pipeline

3.2 Corpus Collection

3.3 Data Processing and Text Cleaning

3.4 Chunking

3.5 Indexing

Dense Embeddings

Sparse Embeddings

3.6 Retrieval

3.7 Reranking

3.8 Answer Generation

4. Local Setup and Environment Configuration

Initial Repository Setup

Backend Setup

Frontend Setup

5. Code Structure and Flow

6. Evaluation Metrics

BLEU Score

ROUGE-L F1

BERTScore F1

7. Contribution Guide

Getting Started

Development Guidelines

8. Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages