Skip to content

End-to-end MLOps pipeline showcasing senior-level best practices with Hydra for configuration, MLflow for experiment tracking, Optuna for hyperparameter tuning, and DVC for data/version control. This repository focuses on reproducibility, modular design, and streamlined collaboration—an ideal demonstration of advanced MLOps capabilities.

Notifications You must be signed in to change notification settings

kletobias/advanced-mlops-lifecycle-hydra-mlflow-optuna-dvc

Repository files navigation

CI Infrastructure: Terraform MLflow Optuna Hydra DVC Prefect

abstract_city_data_visualization_neon_blue-2

Medical DRG in NY — A Reproducible ML & IaC Pipeline

Config-driven analytics on New York State DRG data with Hydra, DVC, MLflow, Optuna, and Terraform-provisioned AWS infrastructure. The project showcases senior-level MLOps patterns: modular transformations, data lineage, hyper-parameter search, IaC, and CI/CD automation.

Spotlight Series & Deep-Dive Articles

Links to every Spotlight post and deep-dive project note on each MLOps pipeline component—DVC, feature engineering, hyperparameter tuning, logging, MLflow, modular code, Jinja2 templates, and transformations.

Articles

Videos

Note: This repo is primarily a portfolio project. The pipeline is mostly reproducible, but may require a few manual adjustments to run end-to-end on your machine (explained below). If you just want to inspect the pipeline structure and ML artifacts, you can do so without running the entire pipeline locally.


Dataset

Key Features

  • Hydra Configuration – parameters live outside the codebase.
  • Data Versioning with DVC
  • Experiment Tracking with MLflow
  • Modular Transformations
  • Infrastructure-as-Code (Terraform) – S3 buckets, IAM OIDC role, optional GPU runner are provisioned repeatably.
  • CI/CD via GitHub Actions
  • Metadata Logging

Tests & Validation

This project supports data validation through Pandera-based checks.
The universal_step.py script automatically runs tests whenever the relevant test flags (e.g., check_required_columns, check_row_count) are set to True in configs/transformations/.

How it Works

  • Test Definitions: Located in dependencies/tests/check_required_columns.py and dependencies/tests/check_row_count.py.
  • Test Config: Global YAML configs in configs/tests/base.yaml specify the required columns and expected row counts (these can be overridden in configs/test_params/).
  • Automatic Execution: Once a transformation step completes, the script checks if any test flags are True. If so, it calls the corresponding test function with the parameters from cfg.tests.
  • Failure Handling: Any mismatch or missing column triggers a Pandera SchemaError, halting the pipeline.

Example
To enable the column and row-count checks, set in the respective YAML config:

# configs/transformations/base.yaml
check_required_columns: true
check_row_count: true
# configs/tests_params/base.yaml

# configs/test_params/v0...v13.yaml set version specific values where needed.

check_required_columns:
  required_columns: ["year", "facility_id", "apr_drg_code"]

check_row_count:
  row_count: 1081672

This ensures the pipeline data is consistent and trustworthy throughout each processing stage.

MLflow Experiments & model_tags

Each experiment uses a dynamically generated run_id (from ${run_id_outputs}) to group logs, artifacts, and results. In both rf_optuna_trial and ridge_optuna_trial, we inject model_tags containing:

  • run_id_tag — a unique timestamped ID generated by Hydra (via ${now:%Y-%m-%d_%H-%M-%S} or ${run_id_outputs}).
  • data_version_tag — the current data version in use (${data_versions.data_version_input}), ensuring visibility of dataset provenance.

How it Works

  • configs/transformations/rf_optuna_trial.yaml and configs/transformations/ridge_optuna_trial.yaml define model_tags to pass into the trial function.
  • MLflow uses these tags in all nested runs—training, final_model, and final_model_predict_test.
  • When each run starts, the tags are logged (e.g., run_id_tag becomes a “tag” in the MLflow UI).
  • Artifacts, metrics, and parameters for that run are associated with the same ID, making it easy to filter or query by run_id_tag later.

Example

# configs/transformations/rf_optuna_trial.yaml

model_tags:
  run_id_tag: ${ml_experiments.mlflow_tags.run_id_tag}
  data_version_tag: ${ml_experiments.mlflow_tags.data_version_tag}
  model_tag: RandomForestRegressor # gets updated automatically for each trial

This ensures each MLflow run is fully traceable to both a particular pipeline execution (via run_id_tag) and the underlying dataset (via data_version_tag). When your experiments scale up, searching and grouping by these tags provides a clear lineage of how each result was produced.

Infrastructure (IaC via Terraform)

infra/
├── main.tf              # root module – backend & providers
├── modules/
│   ├── s3_data/         # DVC remote bucket
│   ├── s3_mlflow/       # MLflow artifacts bucket
│   ├── iam_github_oidc/ # federated role for CI
│   └── gpu_runner/      # optional self-hosted GPU runner
└── environments/
  ├── dev.tfvars
  └── prod.tfvars

Bootstrap

cd infra
terraform init
terraform apply -var-file=environments/dev.tfvars

All resources are tagged mlops-ny-drg and can be removed with terraform destroy.

CI jobs assume GithubOIDCRole, pull data from s3://ghcicd, and push MLflow artifacts.


Continuous Integration (GitHub Actions)

Job Trigger Python Key steps
quick-quality every push / PR 3.12 linters, mypy, pre-commit, DVC pull
full-quality weekly cron (Sat 04 UTC) 3.10/3.11 same as above (matrix)
gpu-pipeline cron, manual, or commit with [gpu] 3.12 dvc repro -P on GPU runner, push artifacts

All jobs cache pip wheels, reuse the same AWS OIDC role, and upload artifacts for inspection.


Repository Structure

.
├── infra/                  # Terraform IaC (see above)
├── .github/workflows/      # ci.yaml – quality + GPU jobs
├── configs/                # Hydra configs
│   ├── config.yaml
│   ├── data_versions/
│   ├── pipeline/
│   └── transformations/
├── data/                   # DVC-tracked
├── dependencies/           # transformations/, modeling/
├── scripts/                # universal_step.py, orchestrate_dvc_flow.py
├── logs/
├── dvc.yaml
└── ...

Additional Docs


MLflow Outputs

Although only two random forest trials are shown in the final run, we had previously tested random forest extensively. We limited it to two in the final pipeline because it is more time-consuming, but these results remain representative of the model’s performance from the broader experiments.

The R² for ridge regressions is usually between 0.86 and 0.89, while for the random forest regressor it tends to fall between 0.84 and 0.86.

Our target column is w_total_median_profit. See:

MLflow UI Table

All RMSE Values

Permutation Importances

Check the Logs

A complete pipeline log from running scripts/orchestrate_dvc_flow.py is available in the logs/pipeline directory: pipeline log directory.

Hydra logs from the same execution, with one file per transformation, are located in logs/runs: runs logs directory.


Installation & Basic Setup

Disclaimer: The instructions below assume you’re familiar with DVC, Hydra, and basic Python package management. Because this is a portfolio repo, you may need to tweak some paths in configs/ or environment variables to get everything running on your setup.

1. Create & Activate a Python Environment

conda env create -f env.yaml
conda activate ny

or

micromamba env create -f env.yaml
micromamba activate ny

2. (Optional) Set Environment Variables

Some Hydra configs reference environment variables like $CMD_PYTHON or $PROJECT_ROOT. Define them manually or via a .env file:

export CMD_PYTHON=/path/to/your/conda/envs/ny/bin/python
export PROJECT_ROOT=/path/to/this/repo

3. Pull Data & Artifacts from S3 (Optional)

python dependencies/io/pull_dvc_s3.py

This configures the DVC remote (public S3) and pulls data into data/.

python dependencies/io/pull_mlruns_s3.py

This populates the mlruns/ folder with finalized experiments.

4. Check or Adjust dvc.yaml

If you find path mismatches, edit commands in dvc.yaml or override Hydra values on the CLI.


Running the Pipeline

1. Force Reproduce All Stages

dvc repro --force -P

2. Run a Single Stage

dvc repro --force -s v10_lag_columns

3. Check the Logs

Logs live in logs/runs/${timestamp} with one file per step.


Known Caveats

  1. Manual Pipeline Config Adjustments – export required env vars or update paths.
  2. Mixed Git/DVC Tracking – see data/.
  3. S3 Accessibility – all datasets and artifacts live in a public S3 bucket.

Why This Setup?

  1. Config-Driven & Modular
  2. Version-Controlled & Reproducible
  3. IaC for repeatable cloud resources
  4. CI guarantees style, type-safety, reproducibility
  5. MLflow experiment management

Contact

Tobias Klein — open an issue or DM on LinkedIn.

© 2025 Tobias Klein. All rights reserved. This repository is provided for demonstration and personal review. No license is granted for commercial or non-commercial use, copying, modification, or distribution without explicit, written permission.

About

End-to-end MLOps pipeline showcasing senior-level best practices with Hydra for configuration, MLflow for experiment tracking, Optuna for hyperparameter tuning, and DVC for data/version control. This repository focuses on reproducibility, modular design, and streamlined collaboration—an ideal demonstration of advanced MLOps capabilities.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published