Data Pipeline To Genrate Embeddings & RAG-based Document Q&A System using Databricks

Project Overview

This project implements a scalable, production-grade data engineering pipeline using Databricks that enables semantic search and LLM-powered Q&A over a collection of documents (PDFs, text files, etc.).

It uses:

the Medallion architecture (Bronze → Silver → Gold),
Delta Lake for data reliability,
Databricks Vector Search for similarity retrieval,
Databricks Foundation Model Serving** to power Retrieval-Augmented Generation (RAG).

Tech Stack

Component	Tools Used
Language	Python, PySpark, SQL
Data Lakehouse	Databricks, Delta Lake, Unity Catalog
Ingestion	Auto Loader, DLT
Transformation	Spark UDFs, chunking, metadata extraction
Vector Search	Databricks Delta Sync Index (Vector Search)
Embedding Model	`e5-small-v2`
LLM for RAG	`databricks-llama-4-maverick` (via Foundation Model Serving)
API Deployment	Databricks Model Serving
Observability	Model Context Protocol (MCP), logging to Delta Lake

Databricks Catalog Overview

Instructions

Configure files with prefix 01,02,03 respective under seperate task in a job.
Upload a document in the source volumn under source schema.
Run job to generate the embeddings table, it basically stores your documents data in chunks each chunk represented by a vector.
Run notebook that creates vector endpoint and a vector index on your embeddings column in the table under gold schema.
Run Main notebook with your query and user.

Medallion Architecture

graph TD;
    A[Ingest PDFs to Bronze] --> B[Parse & Clean to Silver];
    B --> C[Chunk + Metadata + Embeddings to Gold];
    C --> D[Vector Search Index];
    D --> E[Query Relevant Chunks];
    E --> F[Pass to LLM with RAG Prompt];
    F --> G[Return Final Answer via API];
    G --> H[Log via MCP to Delta Lake];

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
01_ingest_documents.ipynb		01_ingest_documents.ipynb
02_Tokenize_Data.ipynb		02_Tokenize_Data.ipynb
03_Generate_Embeddings.ipynb		03_Generate_Embeddings.ipynb
Databrick Catalog Overview.png		Databrick Catalog Overview.png
FIASS (optional).ipynb		FIASS (optional).ipynb
Main.ipynb		Main.ipynb
README.md		README.md
Vector Search Endpoint and Index.ipynb		Vector Search Endpoint and Index.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Pipeline To Genrate Embeddings & RAG-based Document Q&A System using Databricks

Project Overview

Tech Stack

Databricks Catalog Overview

Instructions

Medallion Architecture

About

Uh oh!

Releases

Packages

Languages

amitesh0109/docai-dbx

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline To Genrate Embeddings & RAG-based Document Q&A System using Databricks

Project Overview

Tech Stack

Databricks Catalog Overview

Instructions

Medallion Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages