Medical DRG in NY — A Reproducible ML & IaC Pipeline

Config-driven analytics on New York State DRG data with Hydra, DVC, MLflow, Optuna, and Terraform-provisioned AWS infrastructure. The project showcases senior-level MLOps patterns: modular transformations, data lineage, hyper-parameter search, IaC, and CI/CD automation.

Spotlight Series & Deep-Dive Articles

Links to every Spotlight post and deep-dive project note on each MLOps pipeline component—DVC, feature engineering, hyperparameter tuning, logging, MLflow, modular code, Jinja2 templates, and transformations.

Articles

Videos

Note: This repo is primarily a portfolio project. The pipeline is mostly reproducible, but may require a few manual adjustments to run end-to-end on your machine (explained below). If you just want to inspect the pipeline structure and ML artifacts, you can do so without running the entire pipeline locally.

Dataset

Name: 2010 New York State Hospital Inpatient Discharge Data
Info (Kaggle): 2010 New York State Hospital Inpatient Discharge Data
Source: data.world (requires login)

Key Features

Hydra Configuration – parameters live outside the codebase.
Data Versioning with DVC
Experiment Tracking with MLflow
Modular Transformations
Infrastructure-as-Code (Terraform) – S3 buckets, IAM OIDC role, optional GPU runner are provisioned repeatably.
CI/CD via GitHub Actions
Metadata Logging

Tests & Validation

This project supports data validation through Pandera-based checks.
The universal_step.py script automatically runs tests whenever the relevant test flags (e.g., check_required_columns, check_row_count) are set to True in configs/transformations/.

How it Works

Test Definitions: Located in dependencies/tests/check_required_columns.py and dependencies/tests/check_row_count.py.
Test Config: Global YAML configs in configs/tests/base.yaml specify the required columns and expected row counts (these can be overridden in configs/test_params/).
Automatic Execution: Once a transformation step completes, the script checks if any test flags are True. If so, it calls the corresponding test function with the parameters from cfg.tests.
Failure Handling: Any mismatch or missing column triggers a Pandera SchemaError, halting the pipeline.

Example
To enable the column and row-count checks, set in the respective YAML config:

# configs/transformations/base.yaml
check_required_columns: true
check_row_count: true

# configs/tests_params/base.yaml

# configs/test_params/v0...v13.yaml set version specific values where needed.

check_required_columns:
  required_columns: ["year", "facility_id", "apr_drg_code"]

check_row_count:
  row_count: 1081672

This ensures the pipeline data is consistent and trustworthy throughout each processing stage.

MLflow Experiments & `model_tags`

Each experiment uses a dynamically generated run_id (from ${run_id_outputs}) to group logs, artifacts, and results. In both rf_optuna_trial and ridge_optuna_trial, we inject model_tags containing:

run_id_tag — a unique timestamped ID generated by Hydra (via ${now:%Y-%m-%d_%H-%M-%S} or ${run_id_outputs}).
data_version_tag — the current data version in use (${data_versions.data_version_input}), ensuring visibility of dataset provenance.

How it Works

configs/transformations/rf_optuna_trial.yaml and configs/transformations/ridge_optuna_trial.yaml define model_tags to pass into the trial function.
MLflow uses these tags in all nested runs—training, final_model, and final_model_predict_test.
When each run starts, the tags are logged (e.g., run_id_tag becomes a “tag” in the MLflow UI).
Artifacts, metrics, and parameters for that run are associated with the same ID, making it easy to filter or query by run_id_tag later.

Example

# configs/transformations/rf_optuna_trial.yaml

model_tags:
  run_id_tag: ${ml_experiments.mlflow_tags.run_id_tag}
  data_version_tag: ${ml_experiments.mlflow_tags.data_version_tag}
  model_tag: RandomForestRegressor # gets updated automatically for each trial

This ensures each MLflow run is fully traceable to both a particular pipeline execution (via run_id_tag) and the underlying dataset (via data_version_tag). When your experiments scale up, searching and grouping by these tags provides a clear lineage of how each result was produced.

Infrastructure (IaC via Terraform)

infra/
├── main.tf              # root module – backend & providers
├── modules/
│   ├── s3_data/         # DVC remote bucket
│   ├── s3_mlflow/       # MLflow artifacts bucket
│   ├── iam_github_oidc/ # federated role for CI
│   └── gpu_runner/      # optional self-hosted GPU runner
└── environments/
  ├── dev.tfvars
  └── prod.tfvars

Bootstrap

cd infra
terraform init
terraform apply -var-file=environments/dev.tfvars

All resources are tagged mlops-ny-drg and can be removed with terraform destroy.

CI jobs assume GithubOIDCRole, pull data from s3://ghcicd, and push MLflow artifacts.

Continuous Integration (GitHub Actions)

Job	Trigger	Python	Key steps
quick-quality	every push / PR	3.12	linters, mypy, pre-commit, DVC pull
full-quality	weekly cron (Sat 04 UTC)	3.10/3.11	same as above (matrix)
gpu-pipeline	cron, manual, or commit with `[gpu]`	3.12	`dvc repro -P` on GPU runner, push artifacts

All jobs cache pip wheels, reuse the same AWS OIDC role, and upload artifacts for inspection.

Repository Structure

.
├── infra/                  # Terraform IaC (see above)
├── .github/workflows/      # ci.yaml – quality + GPU jobs
├── configs/                # Hydra configs
│   ├── config.yaml
│   ├── data_versions/
│   ├── pipeline/
│   └── transformations/
├── data/                   # DVC-tracked
├── dependencies/           # transformations/, modeling/
├── scripts/                # universal_step.py, orchestrate_dvc_flow.py
├── logs/
├── dvc.yaml
└── ...

Additional Docs

See documentation/detailed_documentation.md for a deeper dive.
Hydra: docs
DVC: docs
MLflow: docs

MLflow Outputs

Although only two random forest trials are shown in the final run, we had previously tested random forest extensively. We limited it to two in the final pipeline because it is more time-consuming, but these results remain representative of the model’s performance from the broader experiments.

The R² for ridge regressions is usually between 0.86 and 0.89, while for the random forest regressor it tends to fall between 0.84 and 0.86.

Our target column is w_total_median_profit. See:

Transformation: agg_severities.py
Associated Config: agg_severities.yaml

Check the Logs

A complete pipeline log from running scripts/orchestrate_dvc_flow.py is available in the logs/pipeline directory: pipeline log directory.

Hydra logs from the same execution, with one file per transformation, are located in logs/runs: runs logs directory.

Installation & Basic Setup

Disclaimer: The instructions below assume you’re familiar with DVC, Hydra, and basic Python package management. Because this is a portfolio repo, you may need to tweak some paths in configs/ or environment variables to get everything running on your setup.

1. Create & Activate a Python Environment

conda env create -f env.yaml
conda activate ny

or

micromamba env create -f env.yaml
micromamba activate ny

2. (Optional) Set Environment Variables

Some Hydra configs reference environment variables like $CMD_PYTHON or $PROJECT_ROOT. Define them manually or via a .env file:

export CMD_PYTHON=/path/to/your/conda/envs/ny/bin/python
export PROJECT_ROOT=/path/to/this/repo

3. Pull Data & Artifacts from S3 (Optional)

python dependencies/io/pull_dvc_s3.py

This configures the DVC remote (public S3) and pulls data into data/.

python dependencies/io/pull_mlruns_s3.py

This populates the mlruns/ folder with finalized experiments.

4. Check or Adjust dvc.yaml

If you find path mismatches, edit commands in dvc.yaml or override Hydra values on the CLI.

Running the Pipeline

1. Force Reproduce All Stages

dvc repro --force -P

2. Run a Single Stage

dvc repro --force -s v10_lag_columns

3. Check the Logs

Logs live in logs/runs/${timestamp} with one file per step.

Known Caveats

Manual Pipeline Config Adjustments – export required env vars or update paths.
Mixed Git/DVC Tracking – see data/.
S3 Accessibility – all datasets and artifacts live in a public S3 bucket.

Why This Setup?

Config-Driven & Modular
Version-Controlled & Reproducible
IaC for repeatable cloud resources
CI guarantees style, type-safety, reproducibility
MLflow experiment management

Contact

Tobias Klein — open an issue or DM on LinkedIn.

© 2025 Tobias Klein. All rights reserved. This repository is provided for demonstration and personal review. No license is granted for commercial or non-commercial use, copying, modification, or distribution without explicit, written permission.

Name		Name	Last commit message	Last commit date
Latest commit History 390 Commits
.dvc		.dvc
.github/workflows		.github/workflows
configs		configs
dashboard		dashboard
data		data
dependencies		dependencies
documentation		documentation
examples		examples
infra/terraform		infra/terraform
logs		logs
mlruns		mlruns
public_helper_functions/version_control_helpers		public_helper_functions/version_control_helpers
scripts		scripts
templates/dvc		templates/dvc
.dvcignore		.dvcignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prefectignore		.prefectignore
Makefile		Makefile
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
env.yaml		env.yaml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Medical DRG in NY — A Reproducible ML & IaC Pipeline

Spotlight Series & Deep-Dive Articles

Articles

Videos

Dataset

Key Features

Tests & Validation

MLflow Experiments & `model_tags`

Infrastructure (IaC via Terraform)

Continuous Integration (GitHub Actions)

Repository Structure

Additional Docs

MLflow Outputs

Check the Logs

Installation & Basic Setup

1. Create & Activate a Python Environment

2. (Optional) Set Environment Variables

3. Pull Data & Artifacts from S3 (Optional)

4. Check or Adjust dvc.yaml

Running the Pipeline

1. Force Reproduce All Stages

2. Run a Single Stage

3. Check the Logs

Known Caveats

Why This Setup?

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

kletobias/advanced-mlops-lifecycle-hydra-mlflow-optuna-dvc

Folders and files

Latest commit

History

Repository files navigation

Medical DRG in NY — A Reproducible ML & IaC Pipeline

Spotlight Series & Deep-Dive Articles

Articles

Videos

Dataset

Key Features

Tests & Validation

MLflow Experiments & model_tags

Infrastructure (IaC via Terraform)

Continuous Integration (GitHub Actions)

Repository Structure

Additional Docs

MLflow Outputs

Check the Logs

Installation & Basic Setup

1. Create & Activate a Python Environment

2. (Optional) Set Environment Variables

3. Pull Data & Artifacts from S3 (Optional)

4. Check or Adjust dvc.yaml

Running the Pipeline

1. Force Reproduce All Stages

2. Run a Single Stage

3. Check the Logs

Known Caveats

Why This Setup?

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

MLflow Experiments & `model_tags`

Packages