Skip to content

Syntax enhancement aka DLS-2 #984

@pditommaso

Description

@pditommaso

This is a request for comments for the implementation of modules feature for Nextflow.

This feature allows the definition of NF processes in the main script or a separate library file, that can be invoked, one or multiple times, as any other routine passing the requested input channels as arguments.

Process definition

The syntax for the definition of a process is nearly identical to the usual one, it only requires the use of processDef instead of process and the omission of the from/into declarations. For example:

processDef index {
    tag "$transcriptome_file.simpleName"

    input:
    file transcriptome 

    output:
    file 'index' 

    script:
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}

The semantic and supported features remain identical to current process. See a complete example here.

Process invocation

Once a process is defined it can be invoked like any other function in the pipeline script. For example:

transcriptome = file(params.transcriptome)
index(transcriptome)

Since the index defines an output channel its return value can be assigned to a channel variable that can be used as usual eg:

transcriptome = file(params.transcriptome)
index_ch = index(transcriptome)
index_ch.println()

If the process were producing two (or more) output channels the multiple assignment syntax can be used to get a reference to the output channels.

Process composition

The result of a process invocation can be passed to another process like any other function, eg:

processDef foo {
  input: 
    val alpha
  output: 
    val delta
    val gamma
  script:
    delta = alpha
    gamma = 'world'
    "some_command_here"
}

processDef bar {
  input:
    val xx
    val yy 
  output:
    stdout()
  script:
    "another_command_here"        
}

bar(foo('Hello'))

Process chaining

Processes can also be invoked as custom operators. For example a process foo taking one input channel can be invoked as:

ch_input1.foo()

when taking two channels as:

ch_input1.foo(ch_input2)

This allows the chaining of built-in operators and processes together eg:

Channel
    .fromFilePairs( params.reads, checkIfExists: true )
    .into { read_pairs_ch; read_pairs2_ch }

index(transcriptome_file)
    .quant(read_pairs_ch)
    .mix(fastqc(read_pairs2_ch))
    .collect()
    .multiqc(multiqc_file)

See the complete script here.

Library file

A library is just a NF script containing one or more processDef declarations. Then the library can be imported using the importLibrary statement, eg:

importLibrary 'path/to/script.nf'

Relative paths are resolved against the project baseDir variable.

Test it

You can try to the current implementation using the version 19.0.0.modules-draft2-SNAPSHOT eg.

NXF_VER=19.0.0.modules-draft2-SNAPSHOT nextflow run rnaseq-nf -r modules

Open points

  1. When a process is defined in a library file, should it be possible to access to the params values? Currently it's possible, but I think this is not a good idea because makes the library depending on the script params making it very fragile.

  2. How to pass parameters to a process defined in library files eg. For example memory and cpus settings? It could be done using config file as usual, still I expect there could be the need to parametrise the process definition and specify the parameters at invocation time.

  3. Should a namespace be used when defining the processes in library? What if two or more processes have the same name in different library files?

  4. One or many processes per library file? Currently it can be defined any number of processes, I'm starting to think that it would be better to allow the definition only of one process per file. This would simplify the reuse across different pipelines, the import in tools such as dockstore and it would make the dependencies of the pipeline more intelligible.

  5. Remote library file? Not sure it's a good idea to being able to import remote hosted files e.g. http://somewhere/script.nf. Remote paths tend to change over time.

  6. Should a versioning number be associated with the process definition? how to use or enforce it?

  7. How test process components? ideally it should be possible to include the required contained in the process definition and unit test each process independently.

  8. How chain a process retuning multiple channels?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions