A production-ready, enterprise-grade AI-powered document processing and conversational search system that transforms multilingual documents into intelligent, searchable knowledge bases using autonomous agents, hybrid vector search, and advanced language models.
- Interactive Q&A powered by Gemini 2.5 Flash with context-aware responses
- Hybrid Search combining dense (Vertex AI) + sparse (SPLADE) embeddings
- Smart Retrieval with Reciprocal Rank Fusion (RRF) for optimal results
- Metadata Intelligence leveraging ETL processing provenance for result ranking
- Document Analysis Agent: Intelligent content analysis and tool selection
- Layout-Aware Block Extractor: Multi-tool extraction (PyMuPDF, Camelot, OCR)
- Semantic Chunking Agent: Content-aware chunking with quality scoring
- Embedding Agent: Hybrid vector generation (dense + sparse)
- Indexing Agent: Dual storage (Qdrant vector DB + Neo4j graph DB)
- Dense Vectors: 768D multilingual embeddings via Vertex AI
- Sparse Vectors: SPLADE-based keyword matching
- Vector Database: Qdrant Cloud with named vectors
- Knowledge Graph: Neo4j Aura with entity relationships
- Fusion: RRF combining semantic + keyword relevance
- 18 Languages: Supported by text-multilingual-embedding-002
- OCR Integration: Tesseract for text extraction from images
- Document AI: Google Document AI for complex layouts
- Table Extraction: Camelot for precise table processing
- Phoenix Tracing: End-to-end pipeline monitoring
- Agent Metadata: Complete processing provenance
- Performance Metrics: Response times, confidence scores, quality indicators
- OpenTelemetry: Industry-standard instrumentation
graph TB
%% Input Layer
PDF[π PDF Documents] --> DIA[π§ Document Analysis Agent]
%% Processing Pipeline
DIA --> |Strategy Decision| LABE[π¦ Layout-Aware Block Extractor]
LABE --> |Blocks| SCA[π§± Semantic Chunking Agent]
SCA --> |Chunks| EA[π’ Embedding Agent]
EA --> |Vectors| IA[πΎ Indexing Agent]
%% Storage Layer
IA --> QD[π Qdrant Vector DB<br/>Dense + Sparse Vectors]
IA --> NEO[πΈοΈ Neo4j Graph DB<br/>Entities + Relations]
%% Query Interface
USER[π€ User Query] --> CA[π€ Chat Agent]
CA --> |Embed Query| EA2[π’ Query Embedder]
EA2 --> |Dense + Sparse| HR[π Hybrid Retriever]
%% Retrieval Layer
HR --> |Vector Search| QD
HR --> |Graph Traversal| NEO
HR --> |RRF Fusion| RR[π Ranked Results]
RR --> CA
CA --> |Response| USER
%% External Services
GEMINI[π― Gemini 2.5 Flash] --> CA
VERTEX[π§ Vertex AI Embeddings] --> EA
VERTEX --> EA2
SPLADE[π€ SPLADE Sparse] --> EA
SPLADE --> EA2
PHOENIX[π Phoenix Observability] --> ALL[All Components]
%% Agent Details
subgraph "5-Agent ETL Pipeline"
DIA
LABE
SCA
EA
IA
end
subgraph "Hybrid Storage"
QD
NEO
end
subgraph "Intelligent Retrieval"
HR
RR
end
style DIA fill:#e1f5fe
style LABE fill:#e8f5e8
style SCA fill:#fff3e0
style EA fill:#f3e5f5
style IA fill:#fce4ec
style QD fill:#e0f2f1
style NEO fill:#e8eaf6
style CA fill:#fff8e1
style HR fill:#f1f8e9
# Clone repository
git clone https://github.com/your-username/Document_RAG.git
cd Document_RAG
# Create conda environment
conda env create -f environment.yml
conda activate docparser-env
# Install dependencies
pip install -r requirements.txt
# Copy configuration template
cp config/config_sample.json config/config.json
# Configure your services in config/config.json:
# - Google Cloud credentials and project ID
# - Qdrant Cloud URL and API key
# - Neo4j Aura connection details
# - Phoenix API key for observability
Required Services:
- Google Cloud: Vertex AI, Document AI (optional)
- Qdrant Cloud: Vector database with hybrid vectors
- Neo4j Aura: Graph database for entity relationships
- Phoenix: Observability and tracing (optional)
# Process a folder of PDFs (recommended for first run)
python scripts/process_folder_langgraph.py --folder SampleDataSet/SampleDataSet/
# Process individual document
python scripts/process_sample_dataset.py
# Reset databases (if needed)
python scripts/reset_databases.py
# Start the AI chat interface
python scripts/chat_demo.py
# Example queries:
# "What revenue was mentioned in Q3 2022?"
# "Show me tables with high processing confidence"
# "How are entities connected in the knowledge graph?"
Document_RAG/
βββ π README.md # This file
βββ βοΈ config/
β βββ config_sample.json # Configuration template
β βββ *.json # Service credentials
βββ π¦ requirements.txt # Python dependencies
βββ π environment.yml # Conda environment
βββ π³ docker-compose.yml # Local development services
βββ π SampleDataSet/ # Test documents (PDFs)
βββ π§ scripts/ # Execution scripts
β βββ chat_demo.py # Interactive chat interface
β βββ process_folder_langgraph.py # Batch document processing
β βββ process_sample_dataset.py # Single document processing
β βββ reset_databases.py # Database cleanup
βββ π§ͺ tests/ # Test suites
β βββ integration/ # End-to-end tests
β βββ unit/ # Component tests
βββ π docs/ # Documentation
β βββ ARCHITECTURE_MIGRATION.md # Technical architecture
β βββ AUTOGEN_CHAT_AGENT.md # Chat agent details
βββ π§ src/doc_pipeline/ # Core pipeline code
βββ chat_agent/ # Conversational interface
β βββ agents.py # AutoGen chat agents
β βββ retrieval_tools.py # Hybrid search implementation
β βββ query_processor.py # Query analysis
βββ ingestion_graph/ # LangGraph ETL pipeline
β βββ agents/ # Autonomous processing agents
β βββ graph.py # Pipeline orchestration
β βββ nodes.py # Processing functions
β βββ state.py # Pipeline state management
βββ docparser/ # Document parsing tools
βββ chunking/ # Text chunking strategies
βββ embeddings/ # Vector generation
βββ graphdb/ # Neo4j integration
βββ observability/ # Phoenix tracing
# Process all PDFs in a folder with full pipeline
python scripts/process_folder_langgraph.py --folder /path/to/pdfs
# Process with custom limits
python scripts/process_folder_langgraph.py --folder /path/to/pdfs --limit 10
# View processing statistics
python scripts/process_folder_langgraph.py --folder /path/to/pdfs --verbose
# Start interactive chat
python scripts/chat_demo.py
# Example queries:
β "What was Apple's revenue in Q3 2023?"
β "Show me energy certificates from Germany"
β "Find invoices with amounts over $1000"
β "What documents were processed by the Document AI agent?"
# Reset both databases
python scripts/reset_databases.py
# Test cloud connections
python tests/integration/test_cloud_connections.py
# Verify service health
python tests/integration/test_basic_connections.py
Dense Vectors (768D)
- Model:
text-multilingual-embedding-002
(Vertex AI) - Purpose: Semantic similarity matching
- Languages: 18 languages supported
- Storage: Qdrant named vector
text-dense
Sparse Vectors (Variable-D)
- Model:
prithivida/Splade_PP_en_v1
(SPLADE) - Purpose: Keyword and entity matching
- Features: Interpretable, exact matching
- Storage: Qdrant named vector
text-sparse
Reciprocal Rank Fusion (RRF)
- Combines dense + sparse search results
- Balanced ranking: semantic + keyword relevance
- Built-in Qdrant fusion query support
- Optimal for hybrid search scenarios
Each processing agent operates autonomously with:
- Decision Logic: Intelligent strategy selection
- Quality Scoring: Confidence metrics for outputs
- Metadata Tracking: Complete processing provenance
- Error Handling: Graceful failure recovery
// Document β Page β Chunk β Entity hierarchy
(:Document)-[:HAS_PAGE]β(:Page)-[:CONTAINS_CHUNK]β(:Chunk)
(:Chunk)-[:CONTAINS_ENTITY]β(:Entity)
(:Entity)-[:RELATED_TO]β(:Entity)
Phoenix Dashboard
- View at: app.phoenix.arize.com
- Traces: End-to-end request flows
- Metrics: Performance, latency, success rates
- Debugging: Agent decisions and quality scores
Key Metrics
- Processing Time: Per agent and total pipeline
- Confidence Scores: Document analysis, extraction, chunking
- Search Performance: Vector similarity, graph traversal
- Quality Indicators: Chunk quality, extraction confidence
- API Keys: Store in secure configuration files (never commit)
- Service Accounts: Use least-privilege Google Cloud IAM
- Network Security: Cloud services with proper authentication
- Data Privacy: Local processing with cloud storage options
Throughput
- Document Processing: ~1-2 docs/minute (depends on complexity)
- Query Response: ~5-15 seconds (with hybrid search)
- Concurrent Users: Supports multiple chat sessions
Optimization
- Sparse Vectors: In-memory indexing for performance
- Connection Pooling: Efficient database connections
- Caching: Query result caching (optional)
- Batch Processing: Optimized for large document sets
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Make your changes and test thoroughly
- Commit:
git commit -m 'Add amazing feature'
- Push:
git push origin feature/amazing-feature
- Open a Pull Request with detailed description
Development Guidelines
- Follow existing code style and patterns
- Add tests for new functionality
- Update documentation for API changes
- Test with real documents before submitting
This project is licensed under the MIT License - see the LICENSE file for details.
Issues & Questions
- GitHub Issues: Report bugs or request features
- Documentation: Check the
docs/
folder for detailed guides
Common Issues
- API Keys: Verify all service credentials in
config/config.json
- Dependencies: Ensure conda environment is activated
- Database Connection: Check Qdrant/Neo4j service status
- Memory: Large documents may require sufficient RAM
π Star this repo if you find it useful! Your support helps improve the project.