This repository provides a reusable and modular foundation for MLOps and Big Data projects.
It is designed to accelerate dataset acquisition and experimentation, and to provide a starting point for MLOps pipeline development and Big Data infrastructure setups.
It includes:
- A containerised (Docker-based) Ubuntu 24.04 LTS βdevboxβ
- Equipped with essential tools (e.g., Python, VisiData, β¦) and easily extensible via
pip3 install
&requirements.txt
, and a clean Dockerfile. - Ready to download and split necessary data set(s).
- Includes scripts to retrieve datasets from Kaggle and store them in a persistent volume
- Out-of-the-box access to configuration files of third-party credentials (e.g.,
kaggle.json
in./config/kaggle
) - Provides utilities for chunking or splitting time-series data (e.g., weekly splits like for Favorita)
- Equipped with essential tools (e.g., Python, VisiData, β¦) and easily extensible via
- Clearly defined folders for persistent data, configuration, and scripts
Follow these steps to get started, download example data (e.g., the Favorita dataset), and begin your project workflow.
Run the following commands in a Unix-like environment (Linux, macOS, or WSL). Note: Native Windows terminals are not recommended due to compatibility issues.
$ git clone git@github.com:PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode.git
# Or: gh repo clone PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode
# Or: git clone https://github.com/PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode.git
# Or: go to https://github.com/PXLAIRobotics/MLOps-RetakeProject-2425-StartingCode and download the ZIP and extract it.
$ cd MLOps-RetakeProject-2425-StartingCode
Read the structure overview depicted below to understand the purpose of:
commands/
config/
scripts/
infrastructure/devbox/
data/
and its subdirectoryraw_data/
config/kaggle/
ποΈ Repository Structure
.
βββ commands/ # CLI scripts and wrappers
βββ config/ # Secrets/configs (excluded from version control)
β βββ kaggle/ # Place to put your local kaggle.json
βββ data/ # Local data
β βββ raw_data/ # Raw data output (after download/extraction/splitting)
βββ documentation/ # A place to put documentation
βββ infrastructure/ # Put all infrastructure code here.
β βββ devbox/ # # Self-contained Docker-based devbox environment
βββ scripts/ # Python utilities for data handling
βββ README.org # This readme.
This sets up a containerised Python environment with all dependencies:
$ cd infrastructure/devbox
$ ./001_build_image.bash
π§ Building Docker image: pxl_mlops_devbox
π€ User: user (UID: 1000, GID: 1000)
[+] Building 37.8s (25/25) FINISHED docker:desktop-linux
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 3.54kB 0.0s
=> [internal] load metadata for docker.io/library/ubuntu:24.04 1.9s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [ 1/20] FROM docker.io/library/ubuntu:24.04@sha256:440dcf6a5640b2ae5c77724e68787a906afb8ddee98bf86db94eea8528c2c076 1.4s
=> => resolve docker.io/library/ubuntu:24.04@sha256:440dcf6a5640b2ae5c77724e68787a906afb8ddee98bf86db94eea8528c2c076 0.0s
=> => sha256:3eff7d219313fd6db206bd90410da1ca5af1ba3e5b71b552381cea789c4c6713 28.86MB / 28.86MB 1.0s
=> => extracting sha256:3eff7d219313fd6db206bd90410da1ca5af1ba3e5b71b552381cea789c4c6713 0.4s
=> [internal] load build context 0.0s
=> => transferring context: 2.32kB 0.0s
=> [ 2/20] RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections 0.2s
=> [ 3/20] RUN apt-get update && apt-get --with-new-pkgs upgrade -y && apt-get install -y sudo 5.9s
=> [ 4/20] RUN rm -f /etc/update-motd.d/10-help-text 0.1s => [ 5/20] COPY ./create_user.bash /usr/local/bin/ 0.0s => [ 6/20] RUN chmod +x /usr/local/bin/create_user.bash 0.1s => [ 7/20] RUN /usr/local/bin/create_user.bash 0.1s => [ 8/20] RUN unset USER_PASSWORD 0.1s => [ 9/20] RUN set -a && . /env_vars && set +a && apt-get install -y --no-install-recommends python3 python3-venv && python3 -m venv /opt/venv && /opt/venv/bin/pip install --upgrade pip setuptools && chown -R "user:$GROUPNAME" /opt/venv && echo 'export PATH="/opt/venv/bin:$PATH"' >> /home/ 6.8s => [10/20] COPY requirements.txt /opt/requirements.txt 0.0s => [11/20] RUN set -a && . /env_vars && set +a && /opt/venv/bin/pip install --no-cache-dir -r /opt/requirements.txt && chown -R user:$GROUPNAME /opt/venv 8.9s => [12/20] RUN apt-get update && apt-get install -y --no-install-recommends git tmux htop vim gosu && rm -rf /var/lib/apt/lists/* 4.9s => [13/20] RUN set -a && . /env_vars && set +a && mkdir -p /home/user/.tmux/plugins/tpm && git clone https://github.com/tmux-plugins/tpm /home/user/.tmux/plugins/tpm && git clone https://github.com/jimeh/tmux-themepack.git /home/user/.tmux-themepack 1.4s => [14/20] RUN set -a && . /env_vars && set +a && echo "PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\W\[\033[00m\]\$ '" >> /home/user/.bashrc 0.1s => [15/20] COPY .tmux.conf /tmp/.tmux.conf 0.0s => [16/20] RUN set -a && . /env_vars && set +a && cp /tmp/.tmux.conf /home/user/.tmux.conf && chown user:$GROUPNAME /home/user/.tmux.conf 0.1s => [17/20] RUN set -a && . /env_vars && set +a && mkdir /data /commands /scripts /home/user/bin && echo 'source "$HOME/.bashrc"' >> /home/user/.bash_profile && echo "alias ll='ls --color=auto -alFh'" >> /home/user/.bashrc && echo "LS_COLORS=$LS_COLORS:'di=1;33:ln=36'" >> /home/user/.bashrc && 0.1s => [18/20] RUN set -a && . /env_vars && set +a 0.1s => [19/20] COPY ./entrypoint.bash /usr/local/bin/dev_entry 0.1s
=> [20/20] RUN chmod +x /usr/local/bin/dev_entry 0.1s
=> exporting to image 5.1s
=> => exporting layers 3.4s
=> => exporting manifest sha256:babdd44a8d7509bb6dd3b388582ddefe076f1332497714b9dddba6acd78fccee 0.0s
=> => exporting config sha256:ea552bf1845d8fe0a6684983ace3e164b7ca1a029a59aacb1f32b879aefd6d44 0.0s
=> => exporting attestation manifest sha256:b8768fcbf700636f138e1db7d82bcebf4544091d6480ce30c817a376e9d273eb 0.0s
=> => exporting manifest list sha256:376331f9312194645fa6b32ea7d79a6ab0bd2dfcff987ab950a45385c174ebb5 0.0s
=> => naming to docker.io/library/pxl_mlops_devbox:latest 0.0s
=> => unpacking to docker.io/library/pxl_mlops_devbox:latest 1.7s
2 warnings found (use docker --debug to expand):
- SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "USER_PASSWORD") (line 34)
- SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ENV "USER_PASSWORD") (line 37)
β
Image 'pxl_mlops_devbox' built successfully.
Note: Itβs possible to override the User information (Name, UID, GID, password).
$ ./002_start_devbox.bash
This mounts your folders into the container:
DIR on repository | DIR in container | Extra information |
---|---|---|
βββ commands/ | /commands | This directory is automatically added to PATH |
βββ config/ | /config | |
βββ data/ | /data | |
βββ scripts/ | /scripts | This directory is automatically added to PATH |
This .bash
script can be easily extended with extra volumes if needed.
- Go to: https://www.kaggle.com/account
- Click βCreate New API Tokenβ under the API section
- This downloads a file:
kaggle.json
Put the kaggle.json
file in the config/kaggle
directory on your host machine, it will automatically be available in the container.
In this example we are going to use the Favorita Grocery Sales Forecasting dataset. Therefore, we need to accept the terms of this dataset.
Visit the dataset page and click βJoin Competitionβ, and follow the necessary steps. https://www.kaggle.com/competitions/favorita-grocery-sales-forecasting
Inside the devbox:
$ run_kaggle_download_script /scripts/download_favorita.py
This will download the dataset (if kaggle.json
is configured and the terms are accepted) and extract it into /data
.
The data will be located in:
$ data/raw_data/favorita-grocery-sales-forecasting/
You can explore the data using:
- Your own Python scripts (place them in
/scripts
) - Or the excellent terminal-based tool VisiData: Open-source data multitool.
For example:
$ vd /data/raw_data/favorita-grocery-sales-forecasting/train.csv
Inspect all files.
Pro tip: Keep an exploration log in Markdown to stay organized and avoid information overload.
Consult the retake project assignment brief of the MLOps and/or Big Data course.
$ /scripts/split_favorita_train_in_weeks.py
β No valid option provided. Use one of:
--overview Show dataset summary
--all Split full dataset by week
--from DATE --to DATE Split only specific date range
--year YYYY --weeks N Split N weeks from ISO Week 1
--year YYYY --start-week W --weeks N Start from ISO Week W
The train.csv
file is quite large, so splitting it into smaller weekly files may improve performance and enable meaningful MLOps or Big Data operations.
$ /scripts/split_favorita_train_in_weeks.py --overview
Scanning dataset for date overview...
π Dataset Overview:
- Oldest date : 2013-01-01
- Newest date : 2017-08-15
- Total days : 1688
- Total weeks : 241
- Total years : 4.62
This tool allows you to split the train.csv
file into weekly chunks.
Examples:
- Split the entire dataset (This will take a lot of time.)
$ /scripts/split_favorita_train_in_weeks.py --all ...
The output is too verbose to include in this guide.
- Split a specific year and number of weeks:
$ /scripts/split_favorita_train_in_weeks.py --year 2016 --start-week 10 --weeks 5 ποΈ Splitting 5 week(s) starting from Week 10, 2016 From 2016-03-07 to 2016-04-10 π¦ Splitting data from 2016-03-07 to 2016-04-10 /scripts/split_favorita_train_in_weeks.py:49: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False. for chunk in pd.read_csv(INPUT_FILE, parse_dates=["date"], chunksize=CHUNK_SIZE): π Writing weekly files to: /data/raw_data/favorita-grocery-sales-forecasting/weeks β Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W10.csv β 662413 rows β Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W11.csv β 665398 rows β Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W12.csv β 657875 rows β Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W13.csv β 681864 rows β Saved /data/raw_data/favorita-grocery-sales-forecasting/weeks/train_2016-W14.csv β 674518 rows
Use the weekly datasets, train models, explore drift, build pipelines β whatever your assignment requires.
As your project evolves, keep refining your work by:
- Revisit step 10 regularly to stay aligned with the project requirements.
- Repeat step 12 (with new split configs)
- Revisit steps 9β11 to explore new slices of data or experiments
- Continue step 13 until your project(s) is(are) completed
Use this directory to implement the requested architecture using Docker compose and all related and necessary tools.
Use the devbox as inspiration. Leverage Docker volumes for persistent storage and shared data access between containers if needed.
You can also add sub-directories in commands, config, scripts, ...
and use those as volumes in order to segregate scripts for specific containers.
Add additional scripts to this directory. Itβs recommended to organize them into subdirectories.
You may also create top-level folders like src/
if your project requires it.
Put all documentation in this directory.
Feel free to fork, modify, or reuse this layout. Contributions or suggestions are welcome.