MLOps/Big Data Dataset Preparation & Infrastructure Template

Overview

This repository provides a reusable and modular foundation for MLOps and Big Data projects.

It is designed to accelerate dataset acquisition and experimentation, and to provide a starting point for MLOps pipeline development and Big Data infrastructure setups.

It includes:

A containerised (Docker-based) Ubuntu 24.04 LTS “devbox”
- Equipped with essential tools (e.g., Python, VisiData, …) and easily extensible via pip3 install & requirements.txt, and a clean Dockerfile.
- Ready to download and split necessary data set(s).
  - Includes scripts to retrieve datasets from Kaggle and store them in a persistent volume
  - Out-of-the-box access to configuration files of third-party credentials (e.g., kaggle.json in ./config/kaggle)
  - Provides utilities for chunking or splitting time-series data (e.g., weekly splits like for Favorita)
Clearly defined folders for persistent data, configuration, and scripts

🚀 Setup & Workflow Guide

Follow these steps to get started, download example data (e.g., the Favorita dataset), and begin your project workflow.

1. Clone this repository

Run the following commands in a Unix-like environment (Linux, macOS, or WSL). Note: Native Windows terminals are not recommended due to compatibility issues.

$ git clone git@github.com:PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode.git
# Or: gh repo clone PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode
# Or: git clone https://github.com/PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode.git
# Or: go to https://github.com/PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode and download the ZIP and extract it.
$ cd MLOps-RetakeProject-2425-StartingCode

2. Get familiar with the structure

Read the structure overview depicted below to understand the purpose of:

commands/
config/
scripts/
infrastructure/devbox/
data/ and its subdirectory raw_data/
config/kaggle/

🗂️ Repository Structure

.
├── commands/                         # CLI scripts and wrappers
├── config/                           # Secrets/configs (excluded from version control)
│   └── kaggle/                       #    Place to put your local kaggle.json
├── data/                             # Local data
│   └── raw_data/                     #    Raw data output (after download/extraction/splitting)
├── documentation/                    # A place to put documentation
├── infrastructure/                   # Put all infrastructure code here.
│   └── devbox/                       #    # Self-contained Docker-based devbox environment
├── scripts/                          # Python utilities for data handling
└── README.org                        # This readme.

3. Build the devbox

This sets up a containerised Python environment with all dependencies:

$ cd infrastructure/devbox
$ ./001_build_image.bash
🔧 Building Docker image: pxl_mlops_devbox
👤 User: user (UID: 1000, GID: 1000)
[+] Building 37.8s (25/25) FINISHED                                                                                                                                                                                                                                                                      docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                                                                     0.0s
 => => transferring dockerfile: 3.54kB                                                                                                                                                                                                                                                                                   0.0s
 => [internal] load metadata for docker.io/library/ubuntu:24.04                                                                                                                                                                                                                                                          1.9s
 => [internal] load .dockerignore                                                                                                                                                                                                                                                                                        0.0s
 => => transferring context: 2B                                                                                                                                                                                                                                                                                          0.0s
 => [ 1/20] FROM docker.io/library/ubuntu:24.04@sha256:440dcf6a5640b2ae5c77724e68787a906afb8ddee98bf86db94eea8528c2c076                                                                                                                                                                                                  1.4s
 => => resolve docker.io/library/ubuntu:24.04@sha256:440dcf6a5640b2ae5c77724e68787a906afb8ddee98bf86db94eea8528c2c076                                                                                                                                                                                                    0.0s
 => => sha256:3eff7d219313fd6db206bd90410da1ca5af1ba3e5b71b552381cea789c4c6713 28.86MB / 28.86MB                                                                                                                                                                                                                         1.0s
 => => extracting sha256:3eff7d219313fd6db206bd90410da1ca5af1ba3e5b71b552381cea789c4c6713                                                                                                                                                                                                                                0.4s
 => [internal] load build context                                                                                                                                                                                                                                                                                        0.0s
 => => transferring context: 2.32kB                                                                                                                                                                                                                                                                                      0.0s
 => [ 2/20] RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections                                                                                                                                                                                                                           0.2s
 => [ 3/20] RUN apt-get update  && apt-get --with-new-pkgs upgrade -y     && apt-get install -y sudo                                                                                                                                                                                                                     5.9s
 => [ 4/20] RUN rm -f /etc/update-motd.d/10-help-text                                                                                                                                                                                                                                                                    0.1s  => [ 5/20] COPY ./create_user.bash /usr/local/bin/                                                                                                                                                                                                                                                                      0.0s  => [ 6/20] RUN chmod +x /usr/local/bin/create_user.bash                                                                                                                                                                                                                                                                 0.1s  => [ 7/20] RUN  /usr/local/bin/create_user.bash                                                                                                                                                                                                                                                                         0.1s  => [ 8/20] RUN unset USER_PASSWORD                                                                                                                                                                                                                                                                                      0.1s  => [ 9/20] RUN set -a && . /env_vars && set +a &&     apt-get install -y --no-install-recommends python3 python3-venv &&     python3 -m venv /opt/venv &&     /opt/venv/bin/pip install --upgrade pip setuptools &&     chown -R "user:$GROUPNAME" /opt/venv &&     echo 'export PATH="/opt/venv/bin:$PATH"' >> /home/  6.8s  => [10/20] COPY requirements.txt /opt/requirements.txt                                                                                                                                                                                                                                                                  0.0s  => [11/20] RUN set -a && . /env_vars && set +a &&     /opt/venv/bin/pip install --no-cache-dir -r /opt/requirements.txt &&     chown -R user:$GROUPNAME /opt/venv                                                                                                                                                       8.9s  => [12/20] RUN apt-get update &&     apt-get install -y --no-install-recommends     git tmux htop vim gosu &&     rm -rf /var/lib/apt/lists/*                                                                                                                                                                           4.9s  => [13/20] RUN set -a && . /env_vars && set +a &&     mkdir -p /home/user/.tmux/plugins/tpm &&     git clone https://github.com/tmux-plugins/tpm /home/user/.tmux/plugins/tpm &&     git clone https://github.com/jimeh/tmux-themepack.git /home/user/.tmux-themepack                                                   1.4s  => [14/20] RUN set -a && . /env_vars && set +a &&     echo "PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\W\[\033[00m\]\$ '" >> /home/user/.bashrc                                                                                                                            0.1s  => [15/20] COPY .tmux.conf /tmp/.tmux.conf                                                                                                                                                                                                                                                                              0.0s  => [16/20] RUN set -a && . /env_vars && set +a &&     cp /tmp/.tmux.conf /home/user/.tmux.conf &&     chown user:$GROUPNAME /home/user/.tmux.conf                                                                                                                                                                       0.1s  => [17/20] RUN set -a && . /env_vars && set +a &&     mkdir /data /commands /scripts /home/user/bin &&     echo 'source "$HOME/.bashrc"' >> /home/user/.bash_profile &&     echo "alias ll='ls --color=auto -alFh'" >> /home/user/.bashrc &&     echo "LS_COLORS=$LS_COLORS:'di=1;33:ln=36'" >> /home/user/.bashrc &&   0.1s  => [18/20] RUN set -a && . /env_vars && set +a                                                                                                                                                                                                                                                                          0.1s  => [19/20] COPY ./entrypoint.bash /usr/local/bin/dev_entry                                                                                                                                                                                                                                                              0.1s
 => [20/20] RUN chmod +x /usr/local/bin/dev_entry                                                                                                                                                                                                                                                                        0.1s
 => exporting to image                                                                                                                                                                                                                                                                                                   5.1s
 => => exporting layers                                                                                                                                                                                                                                                                                                  3.4s
 => => exporting manifest sha256:babdd44a8d7509bb6dd3b388582ddefe076f1332497714b9dddba6acd78fccee                                                                                                                                                                                                                        0.0s
 => => exporting config sha256:ea552bf1845d8fe0a6684983ace3e164b7ca1a029a59aacb1f32b879aefd6d44                                                                                                                                                                                                                          0.0s
 => => exporting attestation manifest sha256:b8768fcbf700636f138e1db7d82bcebf4544091d6480ce30c817a376e9d273eb                                                                                                                                                                                                            0.0s
 => => exporting manifest list sha256:376331f9312194645fa6b32ea7d79a6ab0bd2dfcff987ab950a45385c174ebb5                                                                                                                                                                                                                   0.0s
 => => naming to docker.io/library/pxl_mlops_devbox:latest                                                                                                                                                                                                                                                               0.0s
 => => unpacking to docker.io/library/pxl_mlops_devbox:latest                                                                                                                                                                                                                                                            1.7s

 2 warnings found (use docker --debug to expand):
 - SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "USER_PASSWORD") (line 34)
 - SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ENV "USER_PASSWORD") (line 37)
✅ Image 'pxl_mlops_devbox' built successfully.

Note: It’s possible to override the User information (Name, UID, GID, password).

4. Start the devbox

$ ./002_start_devbox.bash

This mounts your folders into the container:

DIR on repository	DIR in container	Extra information
`├── commands/`	`/commands`	This directory is automatically added to PATH
`├── config/`	`/config`
`├── data/`	`/data`
`├── scripts/`	`/scripts`	This directory is automatically added to PATH

This .bash script can be easily extended with extra volumes if needed.

5. Get your Kaggle API key

Go to: https://www.kaggle.com/account
Click “Create New API Token” under the API section
This downloads a file: kaggle.json

6. Place the API key into the correct folder

Put the kaggle.json file in the config/kaggle directory on your host machine, it will automatically be available in the container.

7. Accept the Kaggle dataset terms (if needed!)

In this example we are going to use the Favorita Grocery Sales Forecasting dataset. Therefore, we need to accept the terms of this dataset.

Visit the dataset page and click “Join Competition”, and follow the necessary steps. https://www.kaggle.com/competitions/favorita-grocery-sales-forecasting

8. Download the Favorita dataset

Inside the devbox:

$ run_kaggle_download_script /scripts/download_favorita.py

This will download the dataset (if kaggle.json is configured and the terms are accepted) and extract it into /data.

9. Explore the data

The data will be located in:

$ data/raw_data/favorita-grocery-sales-forecasting/

You can explore the data using:

Your own Python scripts (place them in /scripts)
Or the excellent terminal-based tool VisiData: Open-source data multitool.
For example:
```
$ vd /data/raw_data/favorita-grocery-sales-forecasting/train.csv
    
```
Inspect all files.

Pro tip: Keep an exploration log in Markdown to stay organized and avoid information overload.

10. Read the project assignment

Consult the retake project assignment brief of the MLOps and/or Big Data course.

11. Check out the weekly train split script for Favorita

$ /scripts/split_favorita_train_in_weeks.py 
❗ No valid option provided. Use one of:
   --overview                         Show dataset summary
   --all                              Split full dataset by week
   --from DATE --to DATE              Split only specific date range
   --year YYYY --weeks N              Split N weeks from ISO Week 1
   --year YYYY --start-week W --weeks N  Start from ISO Week W

The train.csv file is quite large, so splitting it into smaller weekly files may improve performance and enable meaningful MLOps or Big Data operations.

$ /scripts/split_favorita_train_in_weeks.py --overview
Scanning dataset for date overview...

📊 Dataset Overview:
- Oldest date : 2013-01-01
- Newest date : 2017-08-15
- Total days  : 1688
- Total weeks : 241
- Total years : 4.62

This tool allows you to split the train.csv file into weekly chunks.

12. Split the Favorita data as needed

Examples:

Split the entire dataset (This will take a lot of time.)
```
$ /scripts/split_favorita_train_in_weeks.py --all
...
    
```
The output is too verbose to include in this guide.

Split a specific year and number of weeks:

$ /scripts/split_favorita_train_in_weeks.py --year 2016 --start-week 10 --weeks 5
🗓️  Splitting 5 week(s) starting from Week 10, 2016
From 2016-03-07 to 2016-04-10
📦 Splitting data from 2016-03-07 to 2016-04-10
/scripts/split_favorita_train_in_weeks.py:49: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.
  for chunk in pd.read_csv(INPUT_FILE, parse_dates=["date"], chunksize=CHUNK_SIZE):
📝 Writing weekly files to: /data/raw_data/favorita-grocery-sales-forecasting/weeks
✅ Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W10.csv — 662413 rows
✅ Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W11.csv — 665398 rows
✅ Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W12.csv — 657875 rows
✅ Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W13.csv — 681864 rows
✅ Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W14.csv — 674518 rows

13. Do your project work

Use the weekly datasets, train models, explore drift, build pipelines — whatever your assignment requires.

14. Iterate

As your project evolves, keep refining your work by:

Revisit step 10 regularly to stay aligned with the project requirements.
Repeat step 12 (with new split configs)
Revisit steps 9–11 to explore new slices of data or experiments
Continue step 13 until your project(s) is(are) completed

📁 infrastructure/

Use this directory to implement the requested architecture using Docker compose and all related and necessary tools. Use the devbox as inspiration. Leverage Docker volumes for persistent storage and shared data access between containers if needed. You can also add sub-directories in commands, config, scripts, ... and use those as volumes in order to segregate scripts for specific containers.

📁 scripts/

Add additional scripts to this directory. It’s recommended to organize them into subdirectories. You may also create top-level folders like src/ if your project requires it.

📁 documentation/

Put all documentation in this directory.

📌 License / Contribution

Feel free to fork, modify, or reuse this layout. Contributions or suggestions are welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLOps/Big Data Dataset Preparation & Infrastructure Template

Overview

🚀 Setup & Workflow Guide

1. Clone this repository

2. Get familiar with the structure

3. Build the devbox

4. Start the devbox

5. Get your Kaggle API key

6. Place the API key into the correct folder

7. Accept the Kaggle dataset terms (if needed!)

8. Download the Favorita dataset

9. Explore the data

10. Read the project assignment

11. Check out the weekly train split script for Favorita

12. Split the Favorita data as needed

13. Do your project work

14. Iterate

📁 infrastructure/

📁 scripts/

📁 documentation/

📌 License / Contribution

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
commands		commands
config/kaggle		config/kaggle
data/raw_data		data/raw_data
documentation		documentation
infrastructure/devbox		infrastructure/devbox
scripts		scripts
.gitignore		.gitignore
README.org		README.org

PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode

Folders and files

Latest commit

History

Repository files navigation

MLOps/Big Data Dataset Preparation & Infrastructure Template

Overview

🚀 Setup & Workflow Guide

1. Clone this repository

2. Get familiar with the structure

3. Build the devbox

4. Start the devbox

5. Get your Kaggle API key

6. Place the API key into the correct folder

7. Accept the Kaggle dataset terms (if needed!)

8. Download the Favorita dataset

9. Explore the data

10. Read the project assignment

11. Check out the weekly train split script for Favorita

12. Split the Favorita data as needed

13. Do your project work

14. Iterate

📁 infrastructure/

📁 scripts/

📁 documentation/

📌 License / Contribution

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages