This repository follows a standard setup I use for data science projects, which includes:
- A research compendium layout, including a local Python package (see File Structure).
- Visual Studio Code (VSC) as the preferred IDE, with recommended extensions.
- A VS Code Dev Container, powered by Docker, as a reproducible development environment.
- pre-commit to manage git hooks.
- Python tooling:
- Black for code formatting (pre-commit and VSC extension). In addition, I follow the Google style guide.
- Ruff (pre-commit and VSC extension) and SonarLint (VSC extension) for linting.
- mypy for type checking (VSC extension).
- uv to compile requirements.
- pdoc to generate API documentation (including a pre-commit hook for generating a local documentation). Python docstrings are written following the Google docstring format and with the help of the autoDocstring VSC extension.
- Automatic versioning of the local package from git tags via setuptools_scm, following semantic versioning.
- SQLFluff as a linter and formatter for SQL files (pre-commit and VSC extension).
- prettier (VSC extension) as a formatter for YAML, JSON and Markdown files.
- Taplo (VSC extension) as a formatter for TOML files.
- A Makefile to provide an interface to common tasks (see Make commands).
- Conventional commit messages (enforced by pre-commit).
.
├── analysis/ # Analysis scripts and notebooks
├── data/ # Data files (usually git ignored)
├── docs/ # API documentation (git ignored)
├── results/ # Output files: figures, tables, etc. (git ignored)
├── scripts/ # Utility scripts (e.g. env setup)
├── src/ # Local Python package
│ ├── __init__.py
│ └── config.py # Configs, constants, settings
├── tests/ # Unit tests for src/
│ └── test_*.py
├── .devcontainer/ # VS Code Dev container setup
├── .vscode/ # VS Code settings and extensions
├── Dockerfile # Dockerfile used for dev container
├── Makefile # Utility commands (docs, env, versioning)
├── pyproject.toml # Configs for package, tools (Ruff, mypy, etc.) and direct deps
├── requirements.txt # Pinned dependencies (generated)
├── taplo.toml # Configs for TOML formatter
├── .pre-commit-config.yaml # Configs for pre-commit
├── .sqlfluff # Configs for SQLFluff
The preferred development environment for this project is a VS Code Dev Container, which provides a consistent and reproducible setup using Docker.
- Install and launch Docker.
- Install and open the project in VS Code.
- Open the container by using the command palette in VS Code (
Ctrl + Shift + P
) to search for "Dev Containers: Open Folder in Container...".
Once inside the container, the dependencies specified in requirements.txt
are installed and the local package is available in editable mode.
If needed, the container can be rebuilt by searching for "Dev Containers: Rebuild Container...".
For more details regarding Dev Containers, or alternative environment setups (venv, conda, etc.), please refer to DEVELOPMENT.md
.
Regardless of the environment, install Git hooks after setup with pre-commit install
to ensure the code is automatically linted and formatted on commit.
Requirements are managed with:
pyproject.toml
: lists direct dependencies of thesrc
package and development dependencies (e.g. for the analysis).requirements.txt
: pins alls dependencies (direct and indirect). This file is automatically generated withuv
and is used to fully recreate the environment.
src
) is not included in requirements.txt
, so installation is a two-step process.
-
Initial setup or adding new direct dependencies:
- Add dependencies to
pyproject.toml
. - Run
make reqs
to compilerequirements.txt
.
- Add dependencies to
-
Upgrading packages: compile new requirements with
uv pip compile pyproject.toml -o requirements.txt --all-extras --upgrade
, then make deps.
Finally, run make deps
to install pinned dependencies and the local package in editable mode.
Common utility commands are available via the Makefile:
make reqs
: Compilerequirements.txt
frompyproject.toml
.make deps
: Install requirements and the local package.make docs
: Generate the package documentation.make tag
: Create and push a new Git tag by incrementing the version.make venv
: Set up a venv environment (see DEVELOPMENT.md).
Delete this section after initialising a project from the template.
This template aims to be relatively lightweight and tailored to my needs. It is therefore opinionated and also in constant evolution, reflecting with data science journey with Python. It is also worth noting that this template is more focused on experimentation rather than sharing a single final product.
-
Initialise your GitHub repository with this template. Alternatively, fork (or copy the content of) this repository.
-
Update
- project metada in
pyproject.toml
, such as the description and the authors. - the repository name (if the template was forked).
- the README (title, badges, sections).
- the license.
- project metada in
-
Set up your preferred development environment (see Development Environment).
-
Specify, compile and install your requirements (see Managing requirements).
-
Adjust the configurations to your needs (e.g. Python configuration in
src/config.py
, the SQL dialect in.sqlfluff
, etc.). -
Add a git tag for the initial version with
git tag -a v0.1.0 -m "Initial setup"
, and push it withgit push origin --tags
. Alternatively, usemake tag
.
The src/
package could contain the following modules or sub-packages depending on the project:
utils
for utility functions.data_processing
,data
ordatasets
for data processing functions.features
: for extracting features.models
: for defining models.evaluation
: for evaluating performance.plots
: for plotting functions.
The repository structure could be extended with:
models/
to store model files.- subfolders in
data/
such asdata/raw/
for storing raw data.
MLflow can be used as a tool to track Machine Learning experiments.
Often, MLflow will be configured so that the results are saved on a remote database and artifact store.
If this is not the case, the following can be added in src/config.py
to set up a local storage for MLflow experiments:
MLRUNS_DIR = RES_DIR / "mlruns"
TRACKING_URI = "file:///" + MLRUNS_DIR.as_posix()
os.environ["MLFLOW_TRACKING_URI"] = TRACKING_URI
Then, the MLflow UI can be launched with:
mlflow ui --backend-store-uri file:///path/to/results/mlruns
Configurations, such as credentials, can be loaded from a .env
file.
This can be achieved by mounting a .env
file directly in the Dev Container, updating the runArgs
option in .devcontainer/devcontainer.json
accordingly.
Alternatively, one can use the python-dotenv package and add the following in src/config.py
:
from dotenv import load_dotenv
load_dotenv()
A full project documentation (beyond the API) could be generated using mkdocs or quartodoc.
This template is not tied to a specific platform and does not include continuous integration workflows. Nevertheless, the template could be extended with the following integrations:
- GitHub's Dependabot for dependency updates, or pip-audit.
- Testing and code coverage.
- Building and hosting documentation.
This template is inspired by the concept of a research compendium, similar projects I created for R projects (e.g. reproducible-workflow), and other, more exhaustive, templates such as: