Skip to content

End-to-end data engineering pipeline demonstrating medallion architecture with Databricks and Delta Lake.

Notifications You must be signed in to change notification settings

amitesh0109/databricks-de-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

46 Commits
Β 
Β 
Β 
Β 

Repository files navigation

E-commerce Data Pipeline

End-to-end data engineering pipeline demonstrating medallion architecture with Databricks and Delta Lake.

πŸ—οΈ Architecture

CSV Data β†’ Bronze Layer β†’ Silver Layer β†’ Gold Layer β†’ Dashboard
(Raw)       (Ingested)    (Cleaned)     (Analytics)   (Business)

πŸ“Š Dataset

  • 500 customers with segments and demographics
  • 100 products across 5 categories
  • 1,000 orders with realistic business patterns
  • 2,000+ order items with pricing and discounts

πŸš€ Quick Start

Prerequisites

  • Databricks workspace
  • Cluster with DBR 12.0+

Execution Order

# Run notebooks in sequence:
%run ./notebooks/01_data_generation      # ~2 min
%run ./notebooks/02_bronze_ingestion     # ~3 min  
%run ./notebooks/03_silver_processing    # ~2 min
%run ./notebooks/04_gold_analytics       # ~2 min

View Results

-- Open Databricks SQL:
%run ./notebooks/05_dashboard_queries

πŸ“ Files

ecommerce-pipeline/
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_data_generation.py       # Generate sample data
β”‚   β”œβ”€β”€ 02_bronze_ingestion.py      # Raw data ingestion
β”‚   β”œβ”€β”€ 03_silver_processing.py     # Data cleaning
β”‚   β”œβ”€β”€ 04_gold_analytics.py        # Business metrics
β”‚   └── 05_dashboard_queries.sql    # Dashboard queries
└── README.md

πŸ’Ό Business Value

  • Customer Segmentation: VIP, Premium, Standard, Basic tiers
  • Revenue Analytics: Monthly trends and performance
  • Product Performance: Category-wise sales analysis
  • Executive KPIs: Real-time business metrics

πŸ› οΈ Technical Skills

  • PySpark: Data transformations and processing
  • Delta Lake: ACID transactions and optimization
  • Medallion Architecture: Bronze/Silver/Gold layers
  • Performance Tuning: Partitioning and Z-ordering
  • Data Quality: Validation and monitoring

πŸ“ˆ Results

Layer Records Purpose
Bronze 2,100+ Raw data ingestion
Silver 1,900+ Cleaned and joined
Gold 50+ Aggregated metrics

Total Runtime: ~10 minutes

About

End-to-end data engineering pipeline demonstrating medallion architecture with Databricks and Delta Lake.

Resources

Stars

Watchers

Forks

Packages

No packages published