Config-driven analytics on New York State DRG data with Hydra, DVC, MLflow, Optuna, and Terraform-provisioned AWS infrastructure. The project showcases senior-level MLOps patterns: modular transformations, data lineage, hyper-parameter search, IaC, and CI/CD automation.
Links to every Spotlight post and deep-dive project note on each MLOps pipeline component—DVC, feature engineering, hyperparameter tuning, logging, MLflow, modular code, Jinja2 templates, and transformations.
- Spotlight The Power of a Single dvc.yaml in MLOps
- Spotlight Feature Engineering for Reproducibility and Scalability
- Spotlight Hyperparameter Tuning with Hydra, Optuna, and MLflow
- Spotlight Logging for MLOps Consistency
- Spotlight MLflow Integration
- Spotlight Modular Code as a Cornerstone of MLOps
- Spotlight Jinja2 Templates for Efficient Pipeline Generation
- Spotlight Modular Transformations
- Exploring dvc.yaml The Engine of a Reproducible Pipeline
- A Comprehensive Look at Feature Engineering in a Modular MLOps Pipeline
- A Comprehensive Look at Hyperparameter Tuning with Hydra and Optuna in an MLOps Pipeline
- A Comprehensive Look at Logging in a Modular MLOps Pipeline
- The Integration Of MLflow In This Project
- A Comprehensive Look at Modular Code in an MLOps Pipeline
- Automating DVC Pipelines with Jinja2 Templates
- Transformations as the Backbone of a Modular MLOps Pipeline
- Transformations as the Backbone of a Modular MLOps Pipeline
- A Comprehensive Look at Hyperparameter Tuning with Hydra and Optuna in an MLOps Pipeline
- Automating DVC Pipelines with Jinja2 Templates
- A Comprehensive Look at Modular Code in an MLOps Pipeline
- The Integration Of MLflow In This Project
- A Comprehensive Look at Logging in a Modular MLOps Pipeline
- A Comprehensive Look at Feature Engineering in a Modular MLOps Pipeline
- Exploring dvc.yaml The Engine of a Reproducible Pipeline
Note: This repo is primarily a portfolio project. The pipeline is mostly reproducible, but may require a few manual adjustments to run end-to-end on your machine (explained below). If you just want to inspect the pipeline structure and ML artifacts, you can do so without running the entire pipeline locally.
- Name: 2010 New York State Hospital Inpatient Discharge Data
- Info (Kaggle): 2010 New York State Hospital Inpatient Discharge Data
- Source: data.world (requires login)
- Hydra Configuration – parameters live outside the codebase.
- Data Versioning with DVC
- Experiment Tracking with MLflow
- Modular Transformations
- Infrastructure-as-Code (Terraform) – S3 buckets, IAM OIDC role, optional GPU runner are provisioned repeatably.
- CI/CD via GitHub Actions
- Metadata Logging
This project supports data validation through Pandera-based checks.
The universal_step.py
script automatically runs tests whenever the relevant test flags (e.g., check_required_columns
, check_row_count
) are set to True
in configs/transformations/
.
How it Works
- Test Definitions: Located in
dependencies/tests/check_required_columns.py
anddependencies/tests/check_row_count.py
. - Test Config: Global YAML configs in
configs/tests/base.yaml
specify the required columns and expected row counts (these can be overridden inconfigs/test_params/
). - Automatic Execution: Once a transformation step completes, the script checks if any test flags are
True
. If so, it calls the corresponding test function with the parameters fromcfg.tests
. - Failure Handling: Any mismatch or missing column triggers a Pandera
SchemaError
, halting the pipeline.
Example
To enable the column and row-count checks, set in the respective YAML config:
# configs/transformations/base.yaml
check_required_columns: true
check_row_count: true
# configs/tests_params/base.yaml
# configs/test_params/v0...v13.yaml set version specific values where needed.
check_required_columns:
required_columns: ["year", "facility_id", "apr_drg_code"]
check_row_count:
row_count: 1081672
This ensures the pipeline data is consistent and trustworthy throughout each processing stage.
Each experiment uses a dynamically generated run_id
(from ${run_id_outputs}
) to group logs, artifacts, and results. In both rf_optuna_trial and ridge_optuna_trial, we inject model_tags
containing:
run_id_tag
— a unique timestamped ID generated by Hydra (via${now:%Y-%m-%d_%H-%M-%S}
or${run_id_outputs}
).data_version_tag
— the current data version in use (${data_versions.data_version_input}
), ensuring visibility of dataset provenance.
How it Works
- configs/transformations/rf_optuna_trial.yaml and configs/transformations/ridge_optuna_trial.yaml define
model_tags
to pass into the trial function. - MLflow uses these tags in all nested runs—
training
,final_model
, andfinal_model_predict_test
. - When each run starts, the tags are logged (e.g.,
run_id_tag
becomes a “tag” in the MLflow UI). - Artifacts, metrics, and parameters for that run are associated with the same ID, making it easy to filter or query by
run_id_tag
later.
Example
# configs/transformations/rf_optuna_trial.yaml
model_tags:
run_id_tag: ${ml_experiments.mlflow_tags.run_id_tag}
data_version_tag: ${ml_experiments.mlflow_tags.data_version_tag}
model_tag: RandomForestRegressor # gets updated automatically for each trial
This ensures each MLflow run is fully traceable to both a particular pipeline execution (via run_id_tag) and the underlying dataset (via data_version_tag). When your experiments scale up, searching and grouping by these tags provides a clear lineage of how each result was produced.
infra/
├── main.tf # root module – backend & providers
├── modules/
│ ├── s3_data/ # DVC remote bucket
│ ├── s3_mlflow/ # MLflow artifacts bucket
│ ├── iam_github_oidc/ # federated role for CI
│ └── gpu_runner/ # optional self-hosted GPU runner
└── environments/
├── dev.tfvars
└── prod.tfvars
Bootstrap
cd infra
terraform init
terraform apply -var-file=environments/dev.tfvars
All resources are tagged mlops-ny-drg
and can be removed with terraform destroy
.
CI jobs assume GithubOIDCRole
, pull data from s3://ghcicd
, and push MLflow artifacts.
Job | Trigger | Python | Key steps |
---|---|---|---|
quick-quality | every push / PR | 3.12 | linters, mypy, pre-commit, DVC pull |
full-quality | weekly cron (Sat 04 UTC) | 3.10/3.11 | same as above (matrix) |
gpu-pipeline | cron, manual, or commit with [gpu] |
3.12 | dvc repro -P on GPU runner, push artifacts |
All jobs cache pip wheels, reuse the same AWS OIDC role, and upload artifacts for inspection.
.
├── infra/ # Terraform IaC (see above)
├── .github/workflows/ # ci.yaml – quality + GPU jobs
├── configs/ # Hydra configs
│ ├── config.yaml
│ ├── data_versions/
│ ├── pipeline/
│ └── transformations/
├── data/ # DVC-tracked
├── dependencies/ # transformations/, modeling/
├── scripts/ # universal_step.py, orchestrate_dvc_flow.py
├── logs/
├── dvc.yaml
└── ...
- See documentation/detailed_documentation.md for a deeper dive.
- Hydra: docs
- DVC: docs
- MLflow: docs
Although only two random forest trials are shown in the final run, we had previously tested random forest extensively. We limited it to two in the final pipeline because it is more time-consuming, but these results remain representative of the model’s performance from the broader experiments.
The R² for ridge regressions is usually between 0.86 and 0.89, while for the random forest regressor it tends to fall between 0.84 and 0.86.
Our target column is w_total_median_profit
. See:
- Transformation: agg_severities.py
- Associated Config: agg_severities.yaml
A complete pipeline log from running scripts/orchestrate_dvc_flow.py
is available in the logs/pipeline
directory: pipeline log directory.
Hydra logs from the same execution, with one file per transformation, are located in logs/runs
: runs logs directory.
Disclaimer: The instructions below assume you’re familiar with DVC, Hydra, and basic Python package management. Because this is a portfolio repo, you may need to tweak some paths in
configs/
or environment variables to get everything running on your setup.
conda env create -f env.yaml
conda activate ny
or
micromamba env create -f env.yaml
micromamba activate ny
Some Hydra configs reference environment variables like $CMD_PYTHON
or $PROJECT_ROOT
. Define them manually or via a .env file:
export CMD_PYTHON=/path/to/your/conda/envs/ny/bin/python
export PROJECT_ROOT=/path/to/this/repo
python dependencies/io/pull_dvc_s3.py
This configures the DVC remote (public S3) and pulls data into data/
.
python dependencies/io/pull_mlruns_s3.py
This populates the mlruns/
folder with finalized experiments.
If you find path mismatches, edit commands in dvc.yaml
or override Hydra values on the CLI.
dvc repro --force -P
dvc repro --force -s v10_lag_columns
Logs live in logs/runs/${timestamp}
with one file per step.
- Manual Pipeline Config Adjustments – export required env vars or update paths.
- Mixed Git/DVC Tracking – see data/.
- S3 Accessibility – all datasets and artifacts live in a public S3 bucket.
- Config-Driven & Modular
- Version-Controlled & Reproducible
- IaC for repeatable cloud resources
- CI guarantees style, type-safety, reproducibility
- MLflow experiment management
Tobias Klein — open an issue or DM on LinkedIn.
© 2025 Tobias Klein. All rights reserved. This repository is provided for demonstration and personal review. No license is granted for commercial or non-commercial use, copying, modification, or distribution without explicit, written permission.