Skip to content

kedardg/Document_RAG

Repository files navigation

πŸ€– Advanced Document RAG Pipeline with Hybrid AI Chat Agent

A production-ready, enterprise-grade AI-powered document processing and conversational search system that transforms multilingual documents into intelligent, searchable knowledge bases using autonomous agents, hybrid vector search, and advanced language models.

🎯 Key Features

πŸ€– Intelligent Chat Agent

  • Interactive Q&A powered by Gemini 2.5 Flash with context-aware responses
  • Hybrid Search combining dense (Vertex AI) + sparse (SPLADE) embeddings
  • Smart Retrieval with Reciprocal Rank Fusion (RRF) for optimal results
  • Metadata Intelligence leveraging ETL processing provenance for result ranking

πŸ”„ 5-Agent Autonomous ETL Pipeline

  • Document Analysis Agent: Intelligent content analysis and tool selection
  • Layout-Aware Block Extractor: Multi-tool extraction (PyMuPDF, Camelot, OCR)
  • Semantic Chunking Agent: Content-aware chunking with quality scoring
  • Embedding Agent: Hybrid vector generation (dense + sparse)
  • Indexing Agent: Dual storage (Qdrant vector DB + Neo4j graph DB)

πŸ” Hybrid Search Architecture

  • Dense Vectors: 768D multilingual embeddings via Vertex AI
  • Sparse Vectors: SPLADE-based keyword matching
  • Vector Database: Qdrant Cloud with named vectors
  • Knowledge Graph: Neo4j Aura with entity relationships
  • Fusion: RRF combining semantic + keyword relevance

🌍 Multilingual & Multi-Modal

  • 18 Languages: Supported by text-multilingual-embedding-002
  • OCR Integration: Tesseract for text extraction from images
  • Document AI: Google Document AI for complex layouts
  • Table Extraction: Camelot for precise table processing

πŸ“Š Enterprise Observability

  • Phoenix Tracing: End-to-end pipeline monitoring
  • Agent Metadata: Complete processing provenance
  • Performance Metrics: Response times, confidence scores, quality indicators
  • OpenTelemetry: Industry-standard instrumentation

πŸ—οΈ System Architecture

graph TB
    %% Input Layer
    PDF[πŸ“„ PDF Documents] --> DIA[🧠 Document Analysis Agent]
    
    %% Processing Pipeline
    DIA --> |Strategy Decision| LABE[πŸ“¦ Layout-Aware Block Extractor]
    LABE --> |Blocks| SCA[🧱 Semantic Chunking Agent]
    SCA --> |Chunks| EA[πŸ”’ Embedding Agent]
    EA --> |Vectors| IA[πŸ’Ύ Indexing Agent]
    
    %% Storage Layer
    IA --> QD[πŸ” Qdrant Vector DB<br/>Dense + Sparse Vectors]
    IA --> NEO[πŸ•ΈοΈ Neo4j Graph DB<br/>Entities + Relations]
    
    %% Query Interface
    USER[πŸ‘€ User Query] --> CA[πŸ€– Chat Agent]
    CA --> |Embed Query| EA2[πŸ”’ Query Embedder]
    EA2 --> |Dense + Sparse| HR[πŸ”„ Hybrid Retriever]
    
    %% Retrieval Layer
    HR --> |Vector Search| QD
    HR --> |Graph Traversal| NEO
    HR --> |RRF Fusion| RR[πŸ“‹ Ranked Results]
    RR --> CA
    CA --> |Response| USER
    
    %% External Services
    GEMINI[🎯 Gemini 2.5 Flash] --> CA
    VERTEX[🧠 Vertex AI Embeddings] --> EA
    VERTEX --> EA2
    SPLADE[πŸ”€ SPLADE Sparse] --> EA
    SPLADE --> EA2
    PHOENIX[πŸ“Š Phoenix Observability] --> ALL[All Components]
    
    %% Agent Details
    subgraph "5-Agent ETL Pipeline"
        DIA
        LABE
        SCA
        EA
        IA
    end
    
    subgraph "Hybrid Storage"
        QD
        NEO
    end
    
    subgraph "Intelligent Retrieval"
        HR
        RR
    end
    
    style DIA fill:#e1f5fe
    style LABE fill:#e8f5e8
    style SCA fill:#fff3e0
    style EA fill:#f3e5f5
    style IA fill:#fce4ec
    style QD fill:#e0f2f1
    style NEO fill:#e8eaf6
    style CA fill:#fff8e1
    style HR fill:#f1f8e9
Loading

πŸš€ Quick Start

1. Environment Setup

# Clone repository
git clone https://github.com/your-username/Document_RAG.git
cd Document_RAG

# Create conda environment
conda env create -f environment.yml
conda activate docparser-env

# Install dependencies
pip install -r requirements.txt

2. Service Configuration

# Copy configuration template
cp config/config_sample.json config/config.json

# Configure your services in config/config.json:
# - Google Cloud credentials and project ID
# - Qdrant Cloud URL and API key  
# - Neo4j Aura connection details
# - Phoenix API key for observability

Required Services:

  • Google Cloud: Vertex AI, Document AI (optional)
  • Qdrant Cloud: Vector database with hybrid vectors
  • Neo4j Aura: Graph database for entity relationships
  • Phoenix: Observability and tracing (optional)

3. Document Processing

# Process a folder of PDFs (recommended for first run)
python scripts/process_folder_langgraph.py --folder SampleDataSet/SampleDataSet/

# Process individual document
python scripts/process_sample_dataset.py

# Reset databases (if needed)
python scripts/reset_databases.py

4. Interactive Chat

# Start the AI chat interface
python scripts/chat_demo.py

# Example queries:
# "What revenue was mentioned in Q3 2022?"
# "Show me tables with high processing confidence"
# "How are entities connected in the knowledge graph?"

πŸ“ Project Structure

Document_RAG/
β”œβ”€β”€ πŸ“„ README.md                     # This file
β”œβ”€β”€ βš™οΈ config/
β”‚   β”œβ”€β”€ config_sample.json           # Configuration template
β”‚   └── *.json                       # Service credentials
β”œβ”€β”€ πŸ“¦ requirements.txt               # Python dependencies
β”œβ”€β”€ 🐍 environment.yml               # Conda environment
β”œβ”€β”€ 🐳 docker-compose.yml            # Local development services
β”œβ”€β”€ πŸ“Š SampleDataSet/                 # Test documents (PDFs)
β”œβ”€β”€ πŸ”§ scripts/                      # Execution scripts
β”‚   β”œβ”€β”€ chat_demo.py                 # Interactive chat interface
β”‚   β”œβ”€β”€ process_folder_langgraph.py  # Batch document processing
β”‚   β”œβ”€β”€ process_sample_dataset.py    # Single document processing
β”‚   └── reset_databases.py           # Database cleanup
β”œβ”€β”€ πŸ§ͺ tests/                       # Test suites
β”‚   β”œβ”€β”€ integration/                 # End-to-end tests
β”‚   └── unit/                       # Component tests
β”œβ”€β”€ πŸ“š docs/                        # Documentation
β”‚   β”œβ”€β”€ ARCHITECTURE_MIGRATION.md   # Technical architecture
β”‚   └── AUTOGEN_CHAT_AGENT.md      # Chat agent details
└── πŸ”§ src/doc_pipeline/            # Core pipeline code
    β”œβ”€β”€ chat_agent/                  # Conversational interface
    β”‚   β”œβ”€β”€ agents.py               # AutoGen chat agents
    β”‚   β”œβ”€β”€ retrieval_tools.py      # Hybrid search implementation
    β”‚   └── query_processor.py      # Query analysis
    β”œβ”€β”€ ingestion_graph/             # LangGraph ETL pipeline
    β”‚   β”œβ”€β”€ agents/                 # Autonomous processing agents
    β”‚   β”œβ”€β”€ graph.py               # Pipeline orchestration
    β”‚   β”œβ”€β”€ nodes.py               # Processing functions
    β”‚   └── state.py               # Pipeline state management
    β”œβ”€β”€ docparser/                   # Document parsing tools
    β”œβ”€β”€ chunking/                    # Text chunking strategies
    β”œβ”€β”€ embeddings/                  # Vector generation
    β”œβ”€β”€ graphdb/                     # Neo4j integration
    └── observability/               # Phoenix tracing

πŸ”§ Usage Examples

Processing Documents

# Process all PDFs in a folder with full pipeline
python scripts/process_folder_langgraph.py --folder /path/to/pdfs

# Process with custom limits
python scripts/process_folder_langgraph.py --folder /path/to/pdfs --limit 10

# View processing statistics
python scripts/process_folder_langgraph.py --folder /path/to/pdfs --verbose

Chat Interface

# Start interactive chat
python scripts/chat_demo.py

# Example queries:
❓ "What was Apple's revenue in Q3 2023?"
❓ "Show me energy certificates from Germany" 
❓ "Find invoices with amounts over $1000"
❓ "What documents were processed by the Document AI agent?"

Database Operations

# Reset both databases
python scripts/reset_databases.py

# Test cloud connections
python tests/integration/test_cloud_connections.py

# Verify service health
python tests/integration/test_basic_connections.py

🧠 Technical Deep Dive

Hybrid Vector Architecture

Dense Vectors (768D)

  • Model: text-multilingual-embedding-002 (Vertex AI)
  • Purpose: Semantic similarity matching
  • Languages: 18 languages supported
  • Storage: Qdrant named vector text-dense

Sparse Vectors (Variable-D)

  • Model: prithivida/Splade_PP_en_v1 (SPLADE)
  • Purpose: Keyword and entity matching
  • Features: Interpretable, exact matching
  • Storage: Qdrant named vector text-sparse

Reciprocal Rank Fusion (RRF)

  • Combines dense + sparse search results
  • Balanced ranking: semantic + keyword relevance
  • Built-in Qdrant fusion query support
  • Optimal for hybrid search scenarios

Agent-Based Processing

Each processing agent operates autonomously with:

  • Decision Logic: Intelligent strategy selection
  • Quality Scoring: Confidence metrics for outputs
  • Metadata Tracking: Complete processing provenance
  • Error Handling: Graceful failure recovery

Knowledge Graph Schema

// Document β†’ Page β†’ Chunk β†’ Entity hierarchy
(:Document)-[:HAS_PAGE]β†’(:Page)-[:CONTAINS_CHUNK]β†’(:Chunk)
(:Chunk)-[:CONTAINS_ENTITY]β†’(:Entity)
(:Entity)-[:RELATED_TO]β†’(:Entity)

πŸ“Š Monitoring & Observability

Phoenix Dashboard

  • View at: app.phoenix.arize.com
  • Traces: End-to-end request flows
  • Metrics: Performance, latency, success rates
  • Debugging: Agent decisions and quality scores

Key Metrics

  • Processing Time: Per agent and total pipeline
  • Confidence Scores: Document analysis, extraction, chunking
  • Search Performance: Vector similarity, graph traversal
  • Quality Indicators: Chunk quality, extraction confidence

πŸ”’ Security & Best Practices

  • API Keys: Store in secure configuration files (never commit)
  • Service Accounts: Use least-privilege Google Cloud IAM
  • Network Security: Cloud services with proper authentication
  • Data Privacy: Local processing with cloud storage options

🚦 Performance & Scalability

Throughput

  • Document Processing: ~1-2 docs/minute (depends on complexity)
  • Query Response: ~5-15 seconds (with hybrid search)
  • Concurrent Users: Supports multiple chat sessions

Optimization

  • Sparse Vectors: In-memory indexing for performance
  • Connection Pooling: Efficient database connections
  • Caching: Query result caching (optional)
  • Batch Processing: Optimized for large document sets

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes and test thoroughly
  4. Commit: git commit -m 'Add amazing feature'
  5. Push: git push origin feature/amazing-feature
  6. Open a Pull Request with detailed description

Development Guidelines

  • Follow existing code style and patterns
  • Add tests for new functionality
  • Update documentation for API changes
  • Test with real documents before submitting

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ†˜ Support

Issues & Questions

Common Issues

  • API Keys: Verify all service credentials in config/config.json
  • Dependencies: Ensure conda environment is activated
  • Database Connection: Check Qdrant/Neo4j service status
  • Memory: Large documents may require sufficient RAM

🌟 Star this repo if you find it useful! Your support helps improve the project.

About

Multilingual document RAG with VertexAI, Autogen, langraph, Qdrant, Neo4j

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published