Skip to content

Starter repository for building a complete MLOps/Big data project. Includes infrastructure setup, data, tooling, placeholders for local experimentation, deployment, and monitoring.

Notifications You must be signed in to change notification settings

PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MLOps/Big Data Dataset Preparation & Infrastructure Template

Overview

This repository provides a reusable and modular foundation for MLOps and Big Data projects.

It is designed to accelerate dataset acquisition and experimentation, and to provide a starting point for MLOps pipeline development and Big Data infrastructure setups.

It includes:

  • A containerised (Docker-based) Ubuntu 24.04 LTS β€œdevbox”
    • Equipped with essential tools (e.g., Python, VisiData, …) and easily extensible via pip3 install & requirements.txt, and a clean Dockerfile.
    • Ready to download and split necessary data set(s).
      • Includes scripts to retrieve datasets from Kaggle and store them in a persistent volume
      • Out-of-the-box access to configuration files of third-party credentials (e.g., kaggle.json in ./config/kaggle)
      • Provides utilities for chunking or splitting time-series data (e.g., weekly splits like for Favorita)
  • Clearly defined folders for persistent data, configuration, and scripts

πŸš€ Setup & Workflow Guide

Follow these steps to get started, download example data (e.g., the Favorita dataset), and begin your project workflow.

1. Clone this repository

Run the following commands in a Unix-like environment (Linux, macOS, or WSL). Note: Native Windows terminals are not recommended due to compatibility issues.

$ git clone git@github.com:PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode.git
# Or: gh repo clone PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode
# Or: git clone https://github.com/PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode.git
# Or: go to https://github.com/PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode and download the ZIP and extract it.
$ cd MLOps-RetakeProject-2425-StartingCode

2. Get familiar with the structure

Read the structure overview depicted below to understand the purpose of:

  • commands/
  • config/
  • scripts/
  • infrastructure/devbox/
  • data/ and its subdirectory raw_data/
  • config/kaggle/

πŸ—‚οΈ Repository Structure

.
β”œβ”€β”€ commands/                         # CLI scripts and wrappers
β”œβ”€β”€ config/                           # Secrets/configs (excluded from version control)
β”‚   └── kaggle/                       #    Place to put your local kaggle.json
β”œβ”€β”€ data/                             # Local data
β”‚   └── raw_data/                     #    Raw data output (after download/extraction/splitting)
β”œβ”€β”€ documentation/                    # A place to put documentation
β”œβ”€β”€ infrastructure/                   # Put all infrastructure code here.
β”‚   └── devbox/                       #    # Self-contained Docker-based devbox environment
β”œβ”€β”€ scripts/                          # Python utilities for data handling
└── README.org                        # This readme.

3. Build the devbox

This sets up a containerised Python environment with all dependencies:

$ cd infrastructure/devbox
$ ./001_build_image.bash
πŸ”§ Building Docker image: pxl_mlops_devbox
πŸ‘€ User: user (UID: 1000, GID: 1000)
[+] Building 37.8s (25/25) FINISHED                                                                                                                                                                                                                                                                      docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                                                                     0.0s
 => => transferring dockerfile: 3.54kB                                                                                                                                                                                                                                                                                   0.0s
 => [internal] load metadata for docker.io/library/ubuntu:24.04                                                                                                                                                                                                                                                          1.9s
 => [internal] load .dockerignore                                                                                                                                                                                                                                                                                        0.0s
 => => transferring context: 2B                                                                                                                                                                                                                                                                                          0.0s
 => [ 1/20] FROM docker.io/library/ubuntu:24.04@sha256:440dcf6a5640b2ae5c77724e68787a906afb8ddee98bf86db94eea8528c2c076                                                                                                                                                                                                  1.4s
 => => resolve docker.io/library/ubuntu:24.04@sha256:440dcf6a5640b2ae5c77724e68787a906afb8ddee98bf86db94eea8528c2c076                                                                                                                                                                                                    0.0s
 => => sha256:3eff7d219313fd6db206bd90410da1ca5af1ba3e5b71b552381cea789c4c6713 28.86MB / 28.86MB                                                                                                                                                                                                                         1.0s
 => => extracting sha256:3eff7d219313fd6db206bd90410da1ca5af1ba3e5b71b552381cea789c4c6713                                                                                                                                                                                                                                0.4s
 => [internal] load build context                                                                                                                                                                                                                                                                                        0.0s
 => => transferring context: 2.32kB                                                                                                                                                                                                                                                                                      0.0s
 => [ 2/20] RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections                                                                                                                                                                                                                           0.2s
 => [ 3/20] RUN apt-get update  && apt-get --with-new-pkgs upgrade -y     && apt-get install -y sudo                                                                                                                                                                                                                     5.9s
 => [ 4/20] RUN rm -f /etc/update-motd.d/10-help-text                                                                                                                                                                                                                                                                    0.1s  => [ 5/20] COPY ./create_user.bash /usr/local/bin/                                                                                                                                                                                                                                                                      0.0s  => [ 6/20] RUN chmod +x /usr/local/bin/create_user.bash                                                                                                                                                                                                                                                                 0.1s  => [ 7/20] RUN  /usr/local/bin/create_user.bash                                                                                                                                                                                                                                                                         0.1s  => [ 8/20] RUN unset USER_PASSWORD                                                                                                                                                                                                                                                                                      0.1s  => [ 9/20] RUN set -a && . /env_vars && set +a &&     apt-get install -y --no-install-recommends python3 python3-venv &&     python3 -m venv /opt/venv &&     /opt/venv/bin/pip install --upgrade pip setuptools &&     chown -R "user:$GROUPNAME" /opt/venv &&     echo 'export PATH="/opt/venv/bin:$PATH"' >> /home/  6.8s  => [10/20] COPY requirements.txt /opt/requirements.txt                                                                                                                                                                                                                                                                  0.0s  => [11/20] RUN set -a && . /env_vars && set +a &&     /opt/venv/bin/pip install --no-cache-dir -r /opt/requirements.txt &&     chown -R user:$GROUPNAME /opt/venv                                                                                                                                                       8.9s  => [12/20] RUN apt-get update &&     apt-get install -y --no-install-recommends     git tmux htop vim gosu &&     rm -rf /var/lib/apt/lists/*                                                                                                                                                                           4.9s  => [13/20] RUN set -a && . /env_vars && set +a &&     mkdir -p /home/user/.tmux/plugins/tpm &&     git clone https://github.com/tmux-plugins/tpm /home/user/.tmux/plugins/tpm &&     git clone https://github.com/jimeh/tmux-themepack.git /home/user/.tmux-themepack                                                   1.4s  => [14/20] RUN set -a && . /env_vars && set +a &&     echo "PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\W\[\033[00m\]\$ '" >> /home/user/.bashrc                                                                                                                            0.1s  => [15/20] COPY .tmux.conf /tmp/.tmux.conf                                                                                                                                                                                                                                                                              0.0s  => [16/20] RUN set -a && . /env_vars && set +a &&     cp /tmp/.tmux.conf /home/user/.tmux.conf &&     chown user:$GROUPNAME /home/user/.tmux.conf                                                                                                                                                                       0.1s  => [17/20] RUN set -a && . /env_vars && set +a &&     mkdir /data /commands /scripts /home/user/bin &&     echo 'source "$HOME/.bashrc"' >> /home/user/.bash_profile &&     echo "alias ll='ls --color=auto -alFh'" >> /home/user/.bashrc &&     echo "LS_COLORS=$LS_COLORS:'di=1;33:ln=36'" >> /home/user/.bashrc &&   0.1s  => [18/20] RUN set -a && . /env_vars && set +a                                                                                                                                                                                                                                                                          0.1s  => [19/20] COPY ./entrypoint.bash /usr/local/bin/dev_entry                                                                                                                                                                                                                                                              0.1s
 => [20/20] RUN chmod +x /usr/local/bin/dev_entry                                                                                                                                                                                                                                                                        0.1s
 => exporting to image                                                                                                                                                                                                                                                                                                   5.1s
 => => exporting layers                                                                                                                                                                                                                                                                                                  3.4s
 => => exporting manifest sha256:babdd44a8d7509bb6dd3b388582ddefe076f1332497714b9dddba6acd78fccee                                                                                                                                                                                                                        0.0s
 => => exporting config sha256:ea552bf1845d8fe0a6684983ace3e164b7ca1a029a59aacb1f32b879aefd6d44                                                                                                                                                                                                                          0.0s
 => => exporting attestation manifest sha256:b8768fcbf700636f138e1db7d82bcebf4544091d6480ce30c817a376e9d273eb                                                                                                                                                                                                            0.0s
 => => exporting manifest list sha256:376331f9312194645fa6b32ea7d79a6ab0bd2dfcff987ab950a45385c174ebb5                                                                                                                                                                                                                   0.0s
 => => naming to docker.io/library/pxl_mlops_devbox:latest                                                                                                                                                                                                                                                               0.0s
 => => unpacking to docker.io/library/pxl_mlops_devbox:latest                                                                                                                                                                                                                                                            1.7s

 2 warnings found (use docker --debug to expand):
 - SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "USER_PASSWORD") (line 34)
 - SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ENV "USER_PASSWORD") (line 37)
βœ… Image 'pxl_mlops_devbox' built successfully.

Note: It’s possible to override the User information (Name, UID, GID, password).

4. Start the devbox

$ ./002_start_devbox.bash

This mounts your folders into the container:

DIR on repositoryDIR in containerExtra information
β”œβ”€β”€ commands//commandsThis directory is automatically added to PATH
β”œβ”€β”€ config//config
β”œβ”€β”€ data//data
β”œβ”€β”€ scripts//scriptsThis directory is automatically added to PATH

This .bash script can be easily extended with extra volumes if needed.

5. Get your Kaggle API key

6. Place the API key into the correct folder

Put the kaggle.json file in the config/kaggle directory on your host machine, it will automatically be available in the container.

7. Accept the Kaggle dataset terms (if needed!)

In this example we are going to use the Favorita Grocery Sales Forecasting dataset. Therefore, we need to accept the terms of this dataset.

Visit the dataset page and click β€œJoin Competition”, and follow the necessary steps. https://www.kaggle.com/competitions/favorita-grocery-sales-forecasting

8. Download the Favorita dataset

Inside the devbox:

$ run_kaggle_download_script /scripts/download_favorita.py

This will download the dataset (if kaggle.json is configured and the terms are accepted) and extract it into /data.

9. Explore the data

The data will be located in:

$ data/raw_data/favorita-grocery-sales-forecasting/

You can explore the data using:

  • Your own Python scripts (place them in /scripts)
  • Or the excellent terminal-based tool VisiData: Open-source data multitool.

    For example:

    $ vd /data/raw_data/favorita-grocery-sales-forecasting/train.csv
        

    Inspect all files.

    Pro tip: Keep an exploration log in Markdown to stay organized and avoid information overload.

10. Read the project assignment

Consult the retake project assignment brief of the MLOps and/or Big Data course.

11. Check out the weekly train split script for Favorita

$ /scripts/split_favorita_train_in_weeks.py 
❗ No valid option provided. Use one of:
   --overview                         Show dataset summary
   --all                              Split full dataset by week
   --from DATE --to DATE              Split only specific date range
   --year YYYY --weeks N              Split N weeks from ISO Week 1
   --year YYYY --start-week W --weeks N  Start from ISO Week W

The train.csv file is quite large, so splitting it into smaller weekly files may improve performance and enable meaningful MLOps or Big Data operations.

$ /scripts/split_favorita_train_in_weeks.py --overview
Scanning dataset for date overview...

πŸ“Š Dataset Overview:
- Oldest date : 2013-01-01
- Newest date : 2017-08-15
- Total days  : 1688
- Total weeks : 241
- Total years : 4.62

This tool allows you to split the train.csv file into weekly chunks.

12. Split the Favorita data as needed

Examples:

  • Split the entire dataset (This will take a lot of time.)
    $ /scripts/split_favorita_train_in_weeks.py --all
    ...
        

    The output is too verbose to include in this guide.

  • Split a specific year and number of weeks:
    $ /scripts/split_favorita_train_in_weeks.py --year 2016 --start-week 10 --weeks 5
    πŸ—“οΈ  Splitting 5 week(s) starting from Week 10, 2016
    From 2016-03-07 to 2016-04-10
    πŸ“¦ Splitting data from 2016-03-07 to 2016-04-10
    /scripts/split_favorita_train_in_weeks.py:49: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.
      for chunk in pd.read_csv(INPUT_FILE, parse_dates=["date"], chunksize=CHUNK_SIZE):
    πŸ“ Writing weekly files to: /data/raw_data/favorita-grocery-sales-forecasting/weeks
    βœ… Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W10.csv β€” 662413 rows
    βœ… Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W11.csv β€” 665398 rows
    βœ… Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W12.csv β€” 657875 rows
    βœ… Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W13.csv β€” 681864 rows
    βœ… Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W14.csv β€” 674518 rows
        

13. Do your project work

Use the weekly datasets, train models, explore drift, build pipelines β€” whatever your assignment requires.

14. Iterate

As your project evolves, keep refining your work by:

  • Revisit step 10 regularly to stay aligned with the project requirements.
  • Repeat step 12 (with new split configs)
  • Revisit steps 9–11 to explore new slices of data or experiments
  • Continue step 13 until your project(s) is(are) completed

πŸ“ infrastructure/

Use this directory to implement the requested architecture using Docker compose and all related and necessary tools. Use the devbox as inspiration. Leverage Docker volumes for persistent storage and shared data access between containers if needed. You can also add sub-directories in commands, config, scripts, ... and use those as volumes in order to segregate scripts for specific containers.

πŸ“ scripts/

Add additional scripts to this directory. It’s recommended to organize them into subdirectories. You may also create top-level folders like src/ if your project requires it.

πŸ“ documentation/

Put all documentation in this directory.

πŸ“Œ License / Contribution

Feel free to fork, modify, or reuse this layout. Contributions or suggestions are welcome.

About

Starter repository for building a complete MLOps/Big data project. Includes infrastructure setup, data, tooling, placeholders for local experimentation, deployment, and monitoring.

Topics

Resources

Stars

Watchers

Forks