Skip to content

remdui/MultivariateNormativeModeling

Repository files navigation

Multivariate Normative Modeling Kit (MNM-Kit)

CI-Pipeline


General Information

Code Repository

The code is structured as follows:

  • The config directory contains configuration files required to run the project.
  • The data directory contains raw and processed data (not included in the repository due to privacy reasons).
  • The docker directory contains required files to run the project in a Docker container.
  • The docs directory contains documentation and literature.
  • The experiments directory contains input and output artifact per experiment.
  • The logs directory contains log files.
  • The models directory contains saved models.
  • The output directory contains output files (e.g., predictions, visualizations, etc.).
  • The scripts directory contains various scripts.
  • The src directory contains the source code.
  • The tests directory contains unit tests.

Please find more information on installing, running, and testing the project in the sections below.


Requirements

This project requires the following dependencies installed:

  • Python >= 3.10
  • Poetry >= 2.1.3

All dependencies README.md are listed in the pyproject.toml file.

Alternatively, Docker can be used to run the project. The Dockerfile is provided in the repository. This requires Docker to be installed on the host machine.


Setup

Directly on Host Machine

You can install the package locally using poetry:

poetry install --no-root

Using Docker

You can build the Docker image using the provided Dockerfile:

docker build -t mnmkit:latest -f ./docker/Dockerfile .

Legacy Setup

In environments where poetry is not available and where Docker is not an option, the project can be run using pip and virtualenv. The following steps can be used to set up the project:

  • Create a requirements.txt file using poetry from the pyproject.toml file:

    poetry export --with=dev --with=test --without-hashes --without-urls | awk '{ print $1 }' FS=';' > requirements.txt
  • Create a virtual environment using virtualenv:

    virtualenv venv && source venv/bin/activate
  • Install the dependencies using pip:

    pip install -r requirements.txt
  • Replace all poetry run python commands with python in all subsequent commands.


Usage

Entry Point

The entry point of the project is the main.py file. Run the project using poetry:

poetry run python src/main.py --config [CONFIG_PATH] --mode [MODE] [OPTIONS]

You can also run the project in a Docker container after building the image (see the Setup section above):

docker run --rm \
           -v ./data:/data \
           -v ./output:/output \
           -v ./logs:/logs \
           -v ./models:/models \
           -v ./config:/config \
           mnmkit:latest \
           --config [CONFIG_PATH] \
           --mode [MODE] \
           [OPTIONS]

Arguments

The main.py file accepts the following required arguments:

  • --config: Path to the configuration file. Default: config/config_default.yml.
  • --mode: Mode to run the project in. Options: train, inference, validate, tune. Default: train.

The main.py file accepts the following optional arguments:

  • --checkpoint: Path to a checkpoint to load.
  • --checkpoint-interval: Interval to save checkpoints.
  • --data_dir: Directory to save data.
  • --debug: Debug mode flag.
  • --device: Device to run the project on. Options: cpu, cuda.
  • --log_dir: Directory to save logs.
  • --log_level: Log level. Options: DEBUG, INFO, WARNING, ERROR, CRITICAL.
  • --models_dir: Directory to save models.
  • --num_workers: Number of workers for data loading.
  • --output_dir: Directory to save output.
  • --skip-preprocessing: Skip preprocessing pipeline flag.
  • --seed: Seed for reproducibility.
  • --verbose: Verbosity flag.

Command line arguments override settings in the provided configuration file.

Datasets

Datasets must be provided in the data directory.

This project supports the following data types:

  • Tabular data in the following formats: csv, rds.

  • Image data in the following formats: jpg, jpeg, png, bmp, tiff. Images should be placed in a directory with the same name as the dataset file (e.g., dataset/<image_name>.jpg). If a test split is provided, the test split directory must be named test (e.g., dataset/test/<image_name>.jpg).

The preprocessing pipeline is automatically applied to the datasets. The configuration file contains settings for the preprocessing pipeline, such as split sizes, transformations, and shuffling. Processed datasets are saved in the data/processed directory.

The internal file format is hdf, but can be changed to csv in the configuration file for debugging purposes (not recommended for large datasets). To view HDF files, use the hdfview tool, see here.

To skip the preprocessing pipeline, use the --skip-preprocessing flag.

Configuration

The project configuration is defined in the config directory. The configuration file is in YAML format. The default configuration file is config/config_default.yml.

Important: Copy this file to create a custom configuration. The default configuration will be overwritten each time the application is run.

The configuration file contains the following sections:

  • dataset: Configuration for the dataset.
  • model: Configuration for the model.
  • train: Configuration for the training process.
  • inference: Configuration for the inference process.
  • validation: Configuration for the validation process.
  • meta: Meta information for the project.
  • general: General configuration for the project.
  • system: System configuration for the project.

The configuration is validated against the schema before a task is run. Missing values are filled with default values from the default configuration file.

Output Files

The application generates different types of artifacts during the execution of the project.

  • Logs: Logs are saved in the logs directory, unless a custom log directory is provided. Logs are saved in the log format.
  • Models: Models are saved in the models directory, unless a custom models directory is provided. Models are represented by their state dictionary and are saved in the pt format. Checkpoints are stored in the checkpoints subdirectory.
  • Output: Output files are saved in the output directory, unless a custom output directory is provided. Examples of output files are metrics, visualizations, and predictions.
  • Processed Data: Processed data is saved in the data/processed directory. Processed data is saved in the csv format.

Output files are not tracked in the repository. Output files and folders can be safely deleted. Additionally, all input and output files are copied to a timestamped experiment folder to keep track of experiments and to enable reproducability.

Scripts

The scripts directory contains various scripts:

  • The data scripts can be used to generate artificial data for testing purposes.
  • The slurm script are used to run the project on the (Snellius) HPC.
  • The sync script can be used to synchronize the required files to the HPC (requires established SSH connection).

Contributing

Dependency Management

Poetry is used for dependency management. Docs for Poetry can be found here.

  • Add a new dependency using poetry:

    poetry add [DEPENDENCY]
  • To update a dependency, use the poetry command:

    poetry update [DEPENDENCY]
  • To remove a dependency, use the poetry command:

    poetry remove [DEPENDENCY]

If the .pyproject.toml is manually updated, make sure to run the following command to update the lock file and to install the new dependencies:

poetry lock && poetry install

Pre-Commit Hooks

Pre-commit hooks are used to ensure consistent and validated code style and quality. Docs for Pre-Commit can be found here. Pre-commit hooks are defined in the .pre-commit-config.yaml file.

  • Install the pre-commit hooks using poetry:

    poetry run pre-commit install
  • Run the pre-commit hooks manually using poetry:

    poetry run pre-commit run --all-files

Note: Code quality can also be manually checked before committing changes using the code quality tools described below in the Code Quality section.

Git Workflow

  • Create a new branch for a new feature or bug fix:

    git checkout -b feature/feature_name
  • Commit changes to the branch:

    git add .
    git commit -m "Commit message"
  • Push the branch to the remote repository:

    git push origin feature/feature_name
  • Create a pull request on GitHub and assign reviewers.

Code Style

  • PEP8 standard for python code is used as code style guide. Docs for PEP8 can be found here.
  • Code style is enforced using the pre-commit hooks and CI/CD pipelines. Manual checks are available using the code quality tools described below in the Code Quality section.
  • Type annotations are used to enforce type checking. Docs for type annotations can be found here.

Code Quality

Code quality is enforced using the bundled code quality tools and unit tests.

Different levels of code quality checks are available:

  • Manual checks are available using the code quality tools described below in the Code Quality section and the testing tools in the Testing section.
  • Pre-commit hooks are used to validate local changes before committing.
  • Remote CI/CD pipelines are used to ensure code quality and to run tests. The CI/CD pipeline is set up using GitHub Actions. The pipeline can be found in the .github/workflows directory.
  • Model validation is performed using the validate mode in the main.py file. This mode can be used to validate the model using a validation dataset.

Code Quality

Static Code Analysis

  • PyLint is used for static code analysis. Docs for PyLint can be found here. Settings for PyLint can be found in the .pylintrc file. Run PyLint using poetry:

    poetry run pylint ./src
  • Ruff can be used for additional static code analysis. Docs for Ruff can be found here. Settings for Ruff can be found in the .ruff file. Run Ruff using poetry:

    poetry run ruff check ./src

Static Type Checking

  • MyPy is used for static type checking. Docs for MyPy can be found here. Run MyPy using poetry:

    poetry run mypy ./src

Import Sorting

  • isort is used for import sorting. Docs for isort can be found here. Run isort using poetry:

    poetry run isort --diff ./src

    To apply the changes to the files, use without the --diff flag:

    poetry run isort ./src

Code Formatting

  • Black (PEP8 compliant) is used for code formatting. Docs for Black can be found here. Run Black using poetry:

    poetry run black --check --diff ./src

    To apply the changes to the files, use without the --check and --diff flags:

    poetry run black ./src
  • Ruff (more aggressive, use with caution[!]) can be used for additional code formatting. Docs for Ruff can be found here. Settings for Ruff can be found in the .ruff file. Run Ruff using poetry:

    poetry run ruff format --check --diff ./src

    To apply the changes to the files, use without the --check and --diff flags:

    poetry run ruff format ./src

Unused Code Cleaning

  • Autoflake can be used to remove unused imports and variables. Docs for Autoflake can be found here. Run Autoflake using poetry:

    poetry run autoflake --remove-all-unused-imports --remove-unused-variables --in-place --check -r .

    To apply the changes to the files, use without the --check flag:

    poetry run autoflake --remove-all-unused-imports --remove-unused-variables --in-place -r .

Docstring Formatting

  • Docstringsformatter is used to format docstrings. Docs for Docstringsformatter can be found here. Run Docstringsformatter using poetry:

    poetry run pydocstringformatter ./src

    To apply the changes to the files, use the --write flag:

    poetry run pydocstringformatter ./src --write

Upgrade Python Syntax (>=3.12 compatible)

  • PyUpgrade can be used to upgrade Python syntax. Docs for PyUpgrade can be found here. Run PyUpgrade using poetry:

    poetry run pyupgrade --py312-plus

Code Security

  • Bandit is used for code vulnerability and security checks. Docs for Bandit can be found here. Settings for Bandit can be found in the .bandit file. Run Bandit using poetry:

    poetry run bandit -r .

Code Metrics

  • Radon is used for code metrics. Docs for Radon can be found here. Various metrics can be calculated using Radon.

    • Cyclomatic Complexity: Measures the complexity of the code. Run Radon using poetry:

      poetry run radon cc ./src
    • Maintainability Index: Measures the maintainability of the code. Run Radon using poetry:

      poetry run radon mi ./src
    • Halstead Metrics: Measures the complexity of the code. Run Radon using poetry:

      poetry run radon hal ./src
    • Raw Metrics: Measures the raw metrics of the code. Run Radon using poetry:

      poetry run radon raw ./src
  • Xenon is used for automated code complexity checks. Xenon uses Radon under the hood. Docs for Xenon can be found here. Run Xenon using poetry:

    poetry run xenon --max-absolute B --max-modules B --max-average A ./src

    Meaning of the flags:

    • Max Absolute: Maximum absolute complexity.
    • Max Modules: Maximum complexity per module.
    • Max Average: Maximum average complexity.

Testing

This code base uses different levels of testing to ensure code quality and functionality.

  • Unit testing: Unit tests are used to test individual components of the code base. Unit tests are written using pytest. Docs for pytest can be found here.

Run the unit tests using poetry:

poetry run pytest
  • Code coverage reports: Code coverage reports are generated using pytest-cov (wrapper for Coverage). Docs for pytest-cov can be found here.

Run the code coverage reports using poetry:

poetry run pytest --cov=src --cov-report=term --cov-report=html --cov-report=xml --cov-fail-under=80
  • Property-based testing: Property-based testing is used to test the code base against a wide range of scenarios. Property-based tests are written using hypothesis. Docs for Hypothesis can be found here. These tests are automatically executed together with pytest unit tests.

Hypothesis test statistics can be shown using the following command:

poetry run pytest --hypothesis-show-statistics

Projects Using MNM-Kit

Type of Project Title Author(s) Link
MSc Thesis VAE-based Multivariate Normative Modeling: An Investigation of Covariate Modeling Methods Remy Duijsens View Project

License

This project is licensed under the MIT License - see the LICENSE file for details.


Contact

For questions please contact me at


Releases

No releases published

Contributors 2

  •  
  •  

Languages