End-to-end data engineering pipeline demonstrating medallion architecture with Databricks and Delta Lake.
CSV Data β Bronze Layer β Silver Layer β Gold Layer β Dashboard
(Raw) (Ingested) (Cleaned) (Analytics) (Business)
- 500 customers with segments and demographics
- 100 products across 5 categories
- 1,000 orders with realistic business patterns
- 2,000+ order items with pricing and discounts
- Databricks workspace
- Cluster with DBR 12.0+
# Run notebooks in sequence:
%run ./notebooks/01_data_generation # ~2 min
%run ./notebooks/02_bronze_ingestion # ~3 min
%run ./notebooks/03_silver_processing # ~2 min
%run ./notebooks/04_gold_analytics # ~2 min
-- Open Databricks SQL:
%run ./notebooks/05_dashboard_queries
ecommerce-pipeline/
βββ notebooks/
β βββ 01_data_generation.py # Generate sample data
β βββ 02_bronze_ingestion.py # Raw data ingestion
β βββ 03_silver_processing.py # Data cleaning
β βββ 04_gold_analytics.py # Business metrics
β βββ 05_dashboard_queries.sql # Dashboard queries
βββ README.md
- Customer Segmentation: VIP, Premium, Standard, Basic tiers
- Revenue Analytics: Monthly trends and performance
- Product Performance: Category-wise sales analysis
- Executive KPIs: Real-time business metrics
- PySpark: Data transformations and processing
- Delta Lake: ACID transactions and optimization
- Medallion Architecture: Bronze/Silver/Gold layers
- Performance Tuning: Partitioning and Z-ordering
- Data Quality: Validation and monitoring
Layer | Records | Purpose |
---|---|---|
Bronze | 2,100+ | Raw data ingestion |
Silver | 1,900+ | Cleaned and joined |
Gold | 50+ | Aggregated metrics |
Total Runtime: ~10 minutes