An online cosmetic store, has been experiencing growth in website traffic and customer orders. The business wants to better understand user behavior, optimize marketing efforts, and improve sales performance. Their raw ecommerce event data such as product views, cart additions, and completed purchases are collected using SaaS.
To support data-driven decision-making, their company needs a data pipeline that ingests, processes, and transforms this raw event data into clean, structured tables and dashboards. These insights should help answer key questions like:
- What are the most popular flower products?
- How do users behave before making a purchase?
- What time of day or day of the week has the highest sales volume?
Kaggle Dataset can be found here.
- IaC: Terraform
- Workflow Orchestration: Python/Airflow
- Data Lake: Google Cloud Storage (GCS)
- Data Warehouse: BigQuery + dbt
- Data Visualization: Looker Studio
A virtual machine is recommended with the following:
- Python 3.x
- Terraform
- Astro CLI (for Airflow v2.x)
- Docker CLI/Desktop
- Google Service Account
In the project root directory, run setup.sh
to automatically install Docker, Terraform, and Astro CLI in your system.
sudo ./setup.sh
Important
In your Google Cloud Project, create a Google Service Account with the following roles:
- BigQuery Admin
- Storage Admin
- Viewer
Download your account credentials and place it on your system (e.g. /home/<your-username>/.keys/project_creds.json
)
Then assign that as a system environment variable using:
export GOOGLE_APPLICATION_CREDENTIALS="/home/<your-username>/.keys/project_creds.json"
-
After configuring your service account and terraform, navigate to
./terraform
and modifyvariables.tf
config according to your project details -
Then apply your settings to create the necessary bigquery dataset and gcs bucket:
For first time setting up, initialize terraform:
terraform init
Then apply:
terraform apply
-
In
./airflow
, modifydocker-compose.override.yml
andDockerfile
to properly mount your service account credentials to airflow container. -
Since this project utilizes Kaggle's API, create a new kaggle api token for your credentials
kaggle.json
, and place it into./airflow/include/
.
Start the orchestration using astronomer:
sudo astro dev start
- Check pipeline status on airflow ui at
localhost:8080
(username and password is admin)
You can view the live dashboard here
- makefile
- more detailed schema
- better metrics for dashboard