Skip to content

Refactor pipeline downloads command to use nextflow inspect for container detection #3634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 122 commits into
base: dev
Choose a base branch
from

Conversation

ErikDanielsson
Copy link
Contributor

@ErikDanielsson ErikDanielsson commented Jun 23, 2025

This is a draft PR for refactoring nf-core pipelines download for readability and to use the nextflow inspect command for container detection. It builds upon the excellent work of @MatthiasZepper, @muffato (#3509), @JulianFlesch (refactor/pipeline-download), and @nikhil (#3517).

This should close nf-core/modules#7461

Integrated changes from contributors:

  • Base container fetching primarily on nextflow inspect
  • Add SingularityFetcher class for fetching singularity images
  • Add DockerFetcher class for fetching docker images

Added changes

  • Refactor download.py
    • Split out the WorkflowRepo class. I have not made significant changes to the code (yet)
    • (Re)move regex parsing of container strings. This is still present but moved to download/utils.py. I will remove it entirely once we have tested the nextflow inspect command properly.
    • Add checks and nice reporting for required Nextflow version for nextflow inspect.
  • Refactor SingularityFetcher and DockerFetcher: I've created a superclass ContainerFetcher with a coherent interface and some code sharing.
  • Changed --parallel-downloads to parallel to better reflect the processing of both docker and singularity images. See docker concurrency below.
  • Replace regexes with nextflow inspect

Discussion points related to this

  • I've tested quite a few pipelines, and nextflow inspect works nicely overall. However, some older pipelines require that we specify input parameters, which we then also would have to do for nextflow inspect. I suspect that this change happened when the lib/ folder was removed, is this correct? It is in principle possible to pass (dummy or actual) input parameters with the -params-file flag, but I not sure if it is desired and necessary for current pipelines.

Tests

The nextflow inspect command is required to run on a full pipeline: the command uses a Nextflow file as the entry point by default and then checks what modules are imported in any workflows or subworkflows that are used.
This means that out current tests which are based on stripped modules do not work. I have therefore added a pipeline skeleton in the test data "mock_pipeline_containers" with the following features

  • The tested container URI that nextflow inspect are located in modules/local/ -- I have structured it so that it mirrors the typical structure of local modules in an nf-core pipeline. The tested URIs are taken from the containers present in mock_module_containers
    • The subdirectory passing contains modules where the container strings are correctly captured by nextflow inspect
  • Tested are performed by running nextflow inspect on the entrypoint main_failing_test.nf.
  • Correct behaviour is tested in tests/download/DownloadTest.test_containers_pipeline_<container> where <container> is either singularity or docker. In these tests the output from nextflow inspect . -profile <container> is compared to the true list of containers which is kept in mock_pipeline_containers/per_profile_output as a JSON file.
  • I have left much of the Nextflow code added by the nf-core create command -- we could definitely make this folder leaner if that is desired. Any pointers related to this are very welcome!

Once we are ready to remove the legacy regex code we will be able to remove the old test data folder for module container definitions mock_module_containers.

Discussion points related to this:

  • Should we support container definitions in configs? While nextflow inspect will capture them if they are configured correctly I am not sure if it is desired behavior.
    • Since we will likely make the container definitions more strict (Move to seqera containers 1 and Move to seqera containers 2), I would presume this is the case, but then we need to decide on backwards compatibility.
    • If we want to support it, then we need to port the tests to ensure that containers are capture correctly

Downloading docker images

Support for docker containers is added in this PR. It works slightly different than support for singularity containers:

  • Docker images are always pulled from a registry with docker image pull
  • Images that have been pulled (or otherwise available locally) can be saved to a tar archive with docker image save to be transferred to the offline machine

Currently the singularity options --container-cache-utilisation and --container-cache-index are not supported for Docker images, but the former might be in the future see comment below. Furthermore, since nextflow inspect reports a absolute URI for each container image we do not support the --container-library option for docker either. Integrating support for this is probabily best done in conjunction with the move to Seqera containers.

PR checklist

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md is updated
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

muffato and others added 30 commits March 25, 2025 14:54
Copy link
Contributor

@JulianFlesch JulianFlesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Erik! Thanks for your contribution 🚀

Sorry for taking this long 🙈

@@ -416,10 +416,10 @@ def command_pipelines_lint(
)
@click.option(
"-d",
"--parallel-downloads",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. I wouldn't change unless there is a good reason to.

@@ -416,10 +416,10 @@ def command_pipelines_lint(
)
@click.option(
"-d",
"--parallel-downloads",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do as you see fit, I will also not fight anyone on this :)
FYI: The short flag -t as in tasks / threads, is also still available.

@@ -416,10 +416,10 @@ def command_pipelines_lint(
)
@click.option(
"-d",
"--parallel-downloads",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joining late to the party, but I would vote for keeping the long --parallel-downloads, otherwise I don't see where the short -d comes from.
But also not a strong request, so happy to accept the change if this is the preferred option.

# Set the download URL and return - only applicable for classic downloads
self.wf_download_url = {
**self.wf_download_url,
revision: f"https://github.com/{self.pipeline}/archive/{wf_sha}.zip",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep it to only support GitHub for now

raise DownloadError(e)

except KeyError as e:
log.warning("Failed to parse output of 'nextflow inspect' to extract containers")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here


def pull_and_save_image(self, container: str, output_path: Path) -> None:
"""
Pull a docker image and then save it
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pull a docker image and then save it
Pull a Docker image and then save it

@ErikDanielsson
Copy link
Contributor Author

Docs updated in: nf-core/website#3463

@muffato
Copy link
Member

muffato commented Aug 16, 2025

Hello ! That's an impressive PR, congrats !

I'm testing the feature on a few pipelines and noticed that there are some leftover files from running nextflow

$ env NXF_SINGULARITY_CACHEDIR='' nf-core pipelines download -r 1.11.0 -o fngs -x none -c yes -s singularity -d 4 nf-core/fetchngs
(...)
$ find .nextflow/ work/ null/
.nextflow/
.nextflow/cache
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/000003.log
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/LOCK
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/MANIFEST-000002
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/db/CURRENT
.nextflow/cache/db157365-7982-4394-a757-10579ac6e684/index.kickass_leibniz
.nextflow/plr
work/
null/
null/pipeline_info
null/pipeline_info/execution_timeline_2025-08-16_14-29-22.html
null/pipeline_info/pipeline_dag_2025-08-16_14-29-22.html
null/pipeline_info/execution_report_2025-08-16_14-29-22.html
null/pipeline_info/execution_trace_2025-08-16_14-29-22.txt

(null is the default value of outdir for this pipeline. For other pipelines it could be something else)

Have you noticed that ?


Also, images are created with permissions rw------- (and it's not a umask thing from my environment).

$ ls -l fngs/singularity-images/ | grep -v '^l'
total 523852
-rw------- 1 mm49 team328 154169344 Aug 16 14:32 mulled-v2-5f89fe0cd045cb1d615630b9261a1d17943a9b6a-2f4a4c900edd6801ff0068c2b3048b4459d119eb-0.img
-rw------- 1 mm49 team328  62779423 Aug 16 14:32 python-3.9--1.img
-rw------- 1 mm49 team328 137523200 Aug 16 14:32 sra-tools-2.11.0--pl5321ha49a11a_3.img
-rw------- 1 mm49 team328 154169344 Aug 16 14:32 sra-tools-3.0.8--h9f5acd7_0.img
-rw------- 1 mm49 team328  27750400 Aug 16 14:32 ubuntu-20.04.img
$ umask
0022
$ touch test
$ ls -l test
-rw-r--r-- 1 mm49 team328 0 Aug 16 14:34 test

Finally, to comment on #3634 (comment) and #3634 (comment) , here are my observations from testing combinations of cache/library parameters and options.

  1. The proposed implementation/purpose of -u copy is to create a standalone copy of the pipeline. Thus, its local singularity-images/ directory must be complete, and nextflow.config must be modified to point at it. The pipeline can then run without a cache/library directory externally set.
    Rightly, the command takes images from the pre-existing cache or library if possible, rather than downloading absolutely everything.
    The current implementation also copies images that are downloaded to the cache directory. This has no runtime effect but honours the purpose of the cache/library: after the command is run, they together hold all images as well, thus helping future / other invocations of Nextflow. Note that the cache may not hold all images itself, as some may rather be in the library only.
  2. The proposed implementation/purpose of -u amend is to create a copy of the pipeline that entirely relies on the external cache alone. That's why the local singularity-images/ directory is left empty and all images are rather deposited into the cache. nextflow.config is not modified.
    Here also, the command only downloads the images that are not present in the cache/library.
    Contrary to -u copy, images are copied to the cache to make it complete on its own, rather than allowing some images to be in the library only.
  3. However I don't understand the implementation/purpose of skipping the -u option. When I run nf-core pipelines download without -u, the local singularity-images/ directory is made complete but nextflow.config isn't updated, meaning singularity-images/ is not used. Fortunately, like -u copy, the cache is updated, so the pipeline can still run offline provided the cache (and the library) are set externally. It's essentially something in between -u copy and -u amend, but I can't define the rationale.

My thinking is that:

  1. In -u amend mode, no need to duplicate images from the library to the cache. In other words, assume that the pipeline will run with both the cache and library set the same way as in nf-core pipelines download, rather than only the cache.
  2. In -u amend mode, no need to create a local singularity-images/ directory if it's left empty.
  3. Don't allow -u to be unset ?

Matthieu

@jpfeuffer
Copy link
Contributor

jpfeuffer commented Aug 22, 2025

Hi @ErikDanielsson Does this support authenticated download from a private GitHub repo now? It would be a very easy fix, if instead of curling the repo tar.gz via githubs api url, one just uses the GitHubRepo class that nf-core tools uses everywhere else.
I have the feeling this could be part of a nf-core download refactor as well. You can use my minimal PR: #3607

REf. issue: #2406

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
download nf-core download WIP Work in progress
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

8 participants