This project demonstrates a robust Apache Airflow pipeline architecture that combines dataset-aware scheduling with failsafe monitoring to ensure reliable data processing workflows.
The system implements a data pipeline where:
- 3 Producer DAGs generate data and update their respective datasets
- 1 Coordinator DAG triggers the main processing pipeline when all datasets are ready
- 1 Health Check DAG monitors the main pipeline and provides failsafe triggering via Airflow API
This architecture ensures that your main processing pipeline runs reliably, even if the dataset-aware triggering mechanism fails.
-
Producer DAGs (
producer_for_dataset_a/b/c_dag.py
)- Run on hourly schedule (
0 * * * *
) - Generate data for their respective sources
- Update corresponding datasets upon successful completion
- Run on hourly schedule (
-
Dataset Definitions (
include/datasets.py
)- Centralized dataset URIs for consistency
- Used by both producers and consumers
-
Coordinator DAG (
pipeline_trigger_on_datasets_ready_dag.py
)- Triggered when ALL three datasets are updated
- Launches the main processing pipeline
- Uses dataset-aware scheduling
-
Main Processing DAG (
main_processing_dag.py
)- Contains core business logic
- Triggered by coordinator or failsafe mechanism
- Manual schedule (triggered externally)
-
Health Check DAG (
health_check_and_trigger_main_pipeline_dag.py
)- Monitors main pipeline success status via Airflow API
- Configurable time thresholds (minutes/hours/days)
- Triggers main pipeline if no recent successful runs detected
- Provides failsafe mechanism