Skip to content

tarikbacak/airflow-dataset-trigger-fallback

Repository files navigation

Airflow Dataset-Aware Pipeline with Failsafe Monitoring

This project demonstrates a robust Apache Airflow pipeline architecture that combines dataset-aware scheduling with failsafe monitoring to ensure reliable data processing workflows.

Purpose

The system implements a data pipeline where:

  • 3 Producer DAGs generate data and update their respective datasets
  • 1 Coordinator DAG triggers the main processing pipeline when all datasets are ready
  • 1 Health Check DAG monitors the main pipeline and provides failsafe triggering via Airflow API

This architecture ensures that your main processing pipeline runs reliably, even if the dataset-aware triggering mechanism fails.

Components

  1. Producer DAGs (producer_for_dataset_a/b/c_dag.py)

    • Run on hourly schedule (0 * * * *)
    • Generate data for their respective sources
    • Update corresponding datasets upon successful completion
  2. Dataset Definitions (include/datasets.py)

    • Centralized dataset URIs for consistency
    • Used by both producers and consumers
  3. Coordinator DAG (pipeline_trigger_on_datasets_ready_dag.py)

    • Triggered when ALL three datasets are updated
    • Launches the main processing pipeline
    • Uses dataset-aware scheduling
  4. Main Processing DAG (main_processing_dag.py)

    • Contains core business logic
    • Triggered by coordinator or failsafe mechanism
    • Manual schedule (triggered externally)
  5. Health Check DAG (health_check_and_trigger_main_pipeline_dag.py)

    • Monitors main pipeline success status via Airflow API
    • Configurable time thresholds (minutes/hours/days)
    • Triggers main pipeline if no recent successful runs detected
    • Provides failsafe mechanism

Screenshots

DAGs Overview

DAGs Overview

Health Check Details

Health Check DAG

About

Data-aware Airflow scheduling for versioned datasets, with API-based fallback execution.

Topics

Resources

Stars

Watchers

Forks

Languages