Nextflow is workflow tool frequently used in bioinformatics that can be used to run complex, multi-stage pipelines. For more details on how to use nextflow see the online documentation at https://nextflow.io.
Loading nextflow
Nextflow is installed as a module in /apps
. To load the module:
module load nextflow/21.04.3
Running nextflow on gadi
In nextflow, pipelines are defined as a series of tasks along with a set of a inputs and outputs for each task. Typically, each task is submitted as a separate job to the queue. This requires a long running nextflow process that can manage these tasks. The best way to run this is in its own separate batch queue job:
#!/bin/bash #PBS -lwalltime=24:00:00,ncpus=1,mem=4G,wd #PBS -lstorage=scratch/<abc> #PBS -q normal #PBS -P <abc> module load nextflow/21.04.3 nextflow run hello
Specifying resources
The version of nextflow installed on gadi has been slightly modified to make it easier to specify resource options for jobs submitted to the queueing system. Within the nextflow.config
file for your workflow:
- Use the
pbspro
executor; - Extra flags have been added to specify
project
,storage
andgpus
as an alternative to theclusterOptions
flag; - The
disk
flag can be used to reserve space in /jobfs.
As an example, the process section of your config file might contain:
process { executor = 'pbspro' queue = 'normal' project = '<abc>' storage = 'scratch/<abc>+gdata/<abc>' withName: 'task1' { cpus = 2 time = '1d' memory = '8GB' } }
which is equivalent to:
process { executor = 'pbspro' clusterOptions = '-q normal -P <abc> -lstorage=scratch/<abc>+gdata/<abc>' withName: 'task1' { cpus = 2 time = '1d' memory = '8GB' } }