🤖 Advanced Document RAG Pipeline with Hybrid AI Chat Agent

A production-ready, enterprise-grade AI-powered document processing and conversational search system that transforms multilingual documents into intelligent, searchable knowledge bases using autonomous agents, hybrid vector search, and advanced language models.

🎯 Key Features

🤖 Intelligent Chat Agent

Interactive Q&A powered by Gemini 2.5 Flash with context-aware responses
Hybrid Search combining dense (Vertex AI) + sparse (SPLADE) embeddings
Smart Retrieval with Reciprocal Rank Fusion (RRF) for optimal results
Metadata Intelligence leveraging ETL processing provenance for result ranking

🔄 5-Agent Autonomous ETL Pipeline

Document Analysis Agent: Intelligent content analysis and tool selection
Layout-Aware Block Extractor: Multi-tool extraction (PyMuPDF, Camelot, OCR)
Semantic Chunking Agent: Content-aware chunking with quality scoring
Embedding Agent: Hybrid vector generation (dense + sparse)
Indexing Agent: Dual storage (Qdrant vector DB + Neo4j graph DB)

🔍 Hybrid Search Architecture

Dense Vectors: 768D multilingual embeddings via Vertex AI
Sparse Vectors: SPLADE-based keyword matching
Vector Database: Qdrant Cloud with named vectors
Knowledge Graph: Neo4j Aura with entity relationships
Fusion: RRF combining semantic + keyword relevance

🌍 Multilingual & Multi-Modal

18 Languages: Supported by text-multilingual-embedding-002
OCR Integration: Tesseract for text extraction from images
Document AI: Google Document AI for complex layouts
Table Extraction: Camelot for precise table processing

📊 Enterprise Observability

Phoenix Tracing: End-to-end pipeline monitoring
Agent Metadata: Complete processing provenance
Performance Metrics: Response times, confidence scores, quality indicators
OpenTelemetry: Industry-standard instrumentation

🏗️ System Architecture

graph TB
    %% Input Layer
    PDF[📄 PDF Documents] --> DIA[🧠 Document Analysis Agent]
    
    %% Processing Pipeline
    DIA --> |Strategy Decision| LABE[📦 Layout-Aware Block Extractor]
    LABE --> |Blocks| SCA[🧱 Semantic Chunking Agent]
    SCA --> |Chunks| EA[🔢 Embedding Agent]
    EA --> |Vectors| IA[💾 Indexing Agent]
    
    %% Storage Layer
    IA --> QD[🔍 Qdrant Vector DB<br/>Dense + Sparse Vectors]
    IA --> NEO[🕸️ Neo4j Graph DB<br/>Entities + Relations]
    
    %% Query Interface
    USER[👤 User Query] --> CA[🤖 Chat Agent]
    CA --> |Embed Query| EA2[🔢 Query Embedder]
    EA2 --> |Dense + Sparse| HR[🔄 Hybrid Retriever]
    
    %% Retrieval Layer
    HR --> |Vector Search| QD
    HR --> |Graph Traversal| NEO
    HR --> |RRF Fusion| RR[📋 Ranked Results]
    RR --> CA
    CA --> |Response| USER
    
    %% External Services
    GEMINI[🎯 Gemini 2.5 Flash] --> CA
    VERTEX[🧠 Vertex AI Embeddings] --> EA
    VERTEX --> EA2
    SPLADE[🔤 SPLADE Sparse] --> EA
    SPLADE --> EA2
    PHOENIX[📊 Phoenix Observability] --> ALL[All Components]
    
    %% Agent Details
    subgraph "5-Agent ETL Pipeline"
        DIA
        LABE
        SCA
        EA
        IA
    end
    
    subgraph "Hybrid Storage"
        QD
        NEO
    end
    
    subgraph "Intelligent Retrieval"
        HR
        RR
    end
    
    style DIA fill:#e1f5fe
    style LABE fill:#e8f5e8
    style SCA fill:#fff3e0
    style EA fill:#f3e5f5
    style IA fill:#fce4ec
    style QD fill:#e0f2f1
    style NEO fill:#e8eaf6
    style CA fill:#fff8e1
    style HR fill:#f1f8e9

🚀 Quick Start

1. Environment Setup

# Clone repository
git clone https://github.com/your-username/Document_RAG.git
cd Document_RAG

# Create conda environment
conda env create -f environment.yml
conda activate docparser-env

# Install dependencies
pip install -r requirements.txt

2. Service Configuration

# Copy configuration template
cp config/config_sample.json config/config.json

# Configure your services in config/config.json:
# - Google Cloud credentials and project ID
# - Qdrant Cloud URL and API key  
# - Neo4j Aura connection details
# - Phoenix API key for observability

Required Services:

Google Cloud: Vertex AI, Document AI (optional)
Qdrant Cloud: Vector database with hybrid vectors
Neo4j Aura: Graph database for entity relationships
Phoenix: Observability and tracing (optional)

3. Document Processing

# Process a folder of PDFs (recommended for first run)
python scripts/process_folder_langgraph.py --folder SampleDataSet/SampleDataSet/

# Process individual document
python scripts/process_sample_dataset.py

# Reset databases (if needed)
python scripts/reset_databases.py

4. Interactive Chat

# Start the AI chat interface
python scripts/chat_demo.py

# Example queries:
# "What revenue was mentioned in Q3 2022?"
# "Show me tables with high processing confidence"
# "How are entities connected in the knowledge graph?"

📁 Project Structure

Document_RAG/
├── 📄 README.md                     # This file
├── ⚙️ config/
│   ├── config_sample.json           # Configuration template
│   └── *.json                       # Service credentials
├── 📦 requirements.txt               # Python dependencies
├── 🐍 environment.yml               # Conda environment
├── 🐳 docker-compose.yml            # Local development services
├── 📊 SampleDataSet/                 # Test documents (PDFs)
├── 🔧 scripts/                      # Execution scripts
│   ├── chat_demo.py                 # Interactive chat interface
│   ├── process_folder_langgraph.py  # Batch document processing
│   ├── process_sample_dataset.py    # Single document processing
│   └── reset_databases.py           # Database cleanup
├── 🧪 tests/                       # Test suites
│   ├── integration/                 # End-to-end tests
│   └── unit/                       # Component tests
├── 📚 docs/                        # Documentation
│   ├── ARCHITECTURE_MIGRATION.md   # Technical architecture
│   └── AUTOGEN_CHAT_AGENT.md      # Chat agent details
└── 🔧 src/doc_pipeline/            # Core pipeline code
    ├── chat_agent/                  # Conversational interface
    │   ├── agents.py               # AutoGen chat agents
    │   ├── retrieval_tools.py      # Hybrid search implementation
    │   └── query_processor.py      # Query analysis
    ├── ingestion_graph/             # LangGraph ETL pipeline
    │   ├── agents/                 # Autonomous processing agents
    │   ├── graph.py               # Pipeline orchestration
    │   ├── nodes.py               # Processing functions
    │   └── state.py               # Pipeline state management
    ├── docparser/                   # Document parsing tools
    ├── chunking/                    # Text chunking strategies
    ├── embeddings/                  # Vector generation
    ├── graphdb/                     # Neo4j integration
    └── observability/               # Phoenix tracing

🔧 Usage Examples

Processing Documents

# Process all PDFs in a folder with full pipeline
python scripts/process_folder_langgraph.py --folder /path/to/pdfs

# Process with custom limits
python scripts/process_folder_langgraph.py --folder /path/to/pdfs --limit 10

# View processing statistics
python scripts/process_folder_langgraph.py --folder /path/to/pdfs --verbose

Chat Interface

# Start interactive chat
python scripts/chat_demo.py

# Example queries:
❓ "What was Apple's revenue in Q3 2023?"
❓ "Show me energy certificates from Germany" 
❓ "Find invoices with amounts over $1000"
❓ "What documents were processed by the Document AI agent?"

Database Operations

# Reset both databases
python scripts/reset_databases.py

# Test cloud connections
python tests/integration/test_cloud_connections.py

# Verify service health
python tests/integration/test_basic_connections.py

🧠 Technical Deep Dive

Hybrid Vector Architecture

Dense Vectors (768D)

Model: text-multilingual-embedding-002 (Vertex AI)
Purpose: Semantic similarity matching
Languages: 18 languages supported
Storage: Qdrant named vector text-dense

Sparse Vectors (Variable-D)

Model: prithivida/Splade_PP_en_v1 (SPLADE)
Purpose: Keyword and entity matching
Features: Interpretable, exact matching
Storage: Qdrant named vector text-sparse

Reciprocal Rank Fusion (RRF)

Combines dense + sparse search results
Balanced ranking: semantic + keyword relevance
Built-in Qdrant fusion query support
Optimal for hybrid search scenarios

Agent-Based Processing

Each processing agent operates autonomously with:

Decision Logic: Intelligent strategy selection
Quality Scoring: Confidence metrics for outputs
Metadata Tracking: Complete processing provenance
Error Handling: Graceful failure recovery

Knowledge Graph Schema

// Document → Page → Chunk → Entity hierarchy
(:Document)-[:HAS_PAGE]→(:Page)-[:CONTAINS_CHUNK]→(:Chunk)
(:Chunk)-[:CONTAINS_ENTITY]→(:Entity)
(:Entity)-[:RELATED_TO]→(:Entity)

📊 Monitoring & Observability

Phoenix Dashboard

View at: app.phoenix.arize.com
Traces: End-to-end request flows
Metrics: Performance, latency, success rates
Debugging: Agent decisions and quality scores

Key Metrics

Processing Time: Per agent and total pipeline
Confidence Scores: Document analysis, extraction, chunking
Search Performance: Vector similarity, graph traversal
Quality Indicators: Chunk quality, extraction confidence

🔒 Security & Best Practices

API Keys: Store in secure configuration files (never commit)
Service Accounts: Use least-privilege Google Cloud IAM
Network Security: Cloud services with proper authentication
Data Privacy: Local processing with cloud storage options

🚦 Performance & Scalability

Throughput

Document Processing: ~1-2 docs/minute (depends on complexity)
Query Response: ~5-15 seconds (with hybrid search)
Concurrent Users: Supports multiple chat sessions

Optimization

Sparse Vectors: In-memory indexing for performance
Connection Pooling: Efficient database connections
Caching: Query result caching (optional)
Batch Processing: Optimized for large document sets

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes and test thoroughly
Commit: git commit -m 'Add amazing feature'
Push: git push origin feature/amazing-feature
Open a Pull Request with detailed description

Development Guidelines

Follow existing code style and patterns
Add tests for new functionality
Update documentation for API changes
Test with real documents before submitting

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

Issues & Questions

GitHub Issues: Report bugs or request features
Documentation: Check the docs/ folder for detailed guides

Common Issues

API Keys: Verify all service credentials in config/config.json
Dependencies: Ensure conda environment is activated
Database Connection: Check Qdrant/Neo4j service status
Memory: Large documents may require sufficient RAM

🌟 Star this repo if you find it useful! Your support helps improve the project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖 Advanced Document RAG Pipeline with Hybrid AI Chat Agent

🎯 Key Features

🤖 Intelligent Chat Agent

🔄 5-Agent Autonomous ETL Pipeline

🔍 Hybrid Search Architecture

🌍 Multilingual & Multi-Modal

📊 Enterprise Observability

🏗️ System Architecture

🚀 Quick Start

1. Environment Setup

2. Service Configuration

3. Document Processing

4. Interactive Chat

📁 Project Structure

🔧 Usage Examples

Processing Documents

Chat Interface

Database Operations

🧠 Technical Deep Dive

Hybrid Vector Architecture

Agent-Based Processing

Knowledge Graph Schema

📊 Monitoring & Observability

🔒 Security & Best Practices

🚦 Performance & Scalability

🤝 Contributing

📄 License

🆘 Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
docker		docker
docs		docs
scripts		scripts
src/doc_pipeline		src/doc_pipeline
tests		tests
.env.template		.env.template
.gitignore		.gitignore
README.md		README.md
SampleDataSet.zip		SampleDataSet.zip
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
questions_with_partial_answers(in).csv		questions_with_partial_answers(in).csv
requirements.txt		requirements.txt
setup.py		setup.py

kedardg/Document_RAG

Folders and files

Latest commit

History

Repository files navigation

🤖 Advanced Document RAG Pipeline with Hybrid AI Chat Agent

🎯 Key Features

🤖 Intelligent Chat Agent

🔄 5-Agent Autonomous ETL Pipeline

🔍 Hybrid Search Architecture

🌍 Multilingual & Multi-Modal

📊 Enterprise Observability

🏗️ System Architecture

🚀 Quick Start

1. Environment Setup

2. Service Configuration

3. Document Processing

4. Interactive Chat

📁 Project Structure

🔧 Usage Examples

Processing Documents

Chat Interface

Database Operations

🧠 Technical Deep Dive

Hybrid Vector Architecture

Agent-Based Processing

Knowledge Graph Schema

📊 Monitoring & Observability

🔒 Security & Best Practices

🚦 Performance & Scalability

🤝 Contributing

📄 License

🆘 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages