Usage

To lookup all versions of rapids available in dk92, run

module use /g/data/dk92/apps/Modules/modulefiles/
module avail rapids

To list all python libraries and versions available in a specific module version, say rapids/22.02, run

login-node $ module use /g/data/dk92/apps/Modules/modulefiles/
login-node $ module load rapids/22.02
login-node $ conda list

If you have any specific library in mind, for example, nltk, run the following to look up the version

$ conda list | grep "nltk"
nltk                      3.6.7              pyhd8ed1ab_0    conda-forge

In the return message, the first column prints out the name of the library, followed by the version in the second column.

To test any packages in the module on Gadi, first submit an interactive job to the gpuvolta queue and then load the specific module version of interest once the job starts:

login-node $ qsub -I -P${PROJECT} -qgpuvolta -lwalltime=00:30:00,ncpus=12,ngpus=1,mem=90GB,jobfs=100GB,storage=gdata/dk92+gdata/${PROJECT},other=hyperthread,wd

#wait for the interactive job to start
gpu-node $ module use /g/data/dk92/apps/Modules/modulefiles/
gpu-node $ module load rapids/22.02
gpu-node $ python3
python3 >>> import cudf
...

Note that the gpuvolta queue has a special queue requirement that each GPU has to be requested together with 12 CPU cores. More details on the gpuvolta queue are available here.

Alternatively, if your workflow needs no further interactive development, simply run it as a batch job. An example of a submission script that defines a test job is provided below:

job.sh

#!/bin/bash	
# replace the placeholder <prj_compute> with the real project code of which has enough resource to support this job. 
#PBS -P <prj_compute>
#PBS -q gpuvolta
#PBS -l walltime=00:20:00
#PBS -l ncpus=48
#PBS -l ngpus=4
#PBS -l mem=360GB
#PBS -l jobfs=300GB
#PBS -l storage=gdata/dk92+gdata/<prj_compute>
#PBS -l other=hyperthread
#PBS -l wd 
	
module use /g/data/dk92/apps/Modules/modulefiles/
module load rapids/22.02
python3 dask_cudf.test.py 2>&1 | tee ../dask_cudf.test.$(date +\%Y).$(date +\%m).$(date +\%d).log

This example establishes a dask scheduler with four workers for the python script `dask_cudf.test.py`. Please see this RAPIDS Single Cell example to learn how to start the scheduler and connect with the workers.

To submit the job defined in the above submission script `job.sh`, run

qsub job.sh

on the login node.

Work with Other Python Packages

To install packages used in your workflow that are not included in the dk92 RAPIDS module, we recommend to try first the `--no-binary :all:` option before pulling in binaries and libraries built elsewhere.

For example, on the login node, to install graph-walker on top of the packages included in rapids/22.02, try the following:

module use /g/data/dk92/apps/Modules/modulefiles/
module load rapids/22.02
INSTALL_DIR=/g/data/$PROJECT/.local/envs/rapids22.02_topups
mkdir -p $INSTALL_DIR
python3 -m pip install -v --no-binary :all: --upgrade-strategy only-if-needed --prefix $INSTALL_DIR pybind11==2.9.1
export PYTHONPATH=$INSTALL_DIR/lib/python3.9/site-packages:$PYTHONPATH
python3 -m pip install -v --no-binary :all: --upgrade-strategy only-if-needed --prefix $INSTALL_DIR graph-walker==1.0.6

Note that the package pybind11 is required by graph-walker but is not included in its build process. Therefore, we needed to install pybind11 manually before building graph-walker.

Unfortunately, not all python packages have the install script ready through PyPI to facilitate the building from source option. For example, ray doesn't support it:

$ python3 -m pip install -v --no-binary :all: --upgrade-strategy only-if-needed --prefix $INSTALL_DIR ray
Using pip 22.0.3 from /opt/conda/envs/rapids/lib/python3.9/site-packages/pip (python 3.9)
ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray

In this scenario, we have to drop the `--no-binary :all:` option, and allow the installation to pull in binaries and libraries built elsewhere:

$ python3 -m pip install -v --upgrade-strategy only-if-needed --prefix $INSTALL_DIR ray==1.11.0
$ find $INSTALL_DIR/lib/python3.9/site-packages/ray -type f | grep "\.so" | awk -F"/" '{print $NF}'
_raylet.so
setproctitle.cpython-39-x86_64-linux-gnu.so
_psutil_linux.cpython-39-x86_64-linux-gnu.so
_psutil_posix.cpython-39-x86_64-linux-gnu.so

Calling functions defined in those libraries may or may not work properly on Gadi. Therefore, we recommend undertaking thorough testing to ensure packages installed in this way work as expected.

Examples

The majority of functions provided in the dk92 RAPIDS module are expected to work per their respective documentation. For example, to learn how the function cudf.DataFrame.groupby works, read the following API reference manual: https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.DataFrame.groupby.html. However, there are a select few functions that require Gadi specific considerations and we demonstrate these in the following examples.

E1. Rapids Single Cell Example

In this example we show step-by-step instructions of how to prepare and run a notebook published on GitHub inside a Gadi Interactive job.

E2. Run Rapids on Multiple GPU Nodes

It is possible to run large scale analysis that benefits from massive parallelism on multiple GPU nodes. Below is an example job submission script that requests multiple GPUs on multiple nodes:

job_mnmg.sh

#!/bin/bash	
# replace the placeholder <prj_compute> with the real project code of which has enough resource to support this job. 
#PBS -P <prj_compute>
#PBS -q gpuvolta
#PBS -l walltime=00:20:00
#PBS -l ncpus=96
#PBS -l ngpus=8
#PBS -l mem=720GB
#PBS -l jobfs=400GB
#PBS -l storage=gdata/dk92+gdata/<prj_compute>
#PBS -l wd 
	
module use /g/data/dk92/apps/Modules/modulefiles/
module load rapids/22.02
start.mnmg.dask.scheduler.sh
python3 analysis.runs.on.8.GPUs.py 2>&1 | tee output.log

Before launching the python script `analysis.runs.on.8.GPUs.py` in which the majority of the data operations utilise all the required 8 GPUs, the bash script `start.mnmg.dask.scheduler.sh` initiates the dask scheduler inside the head node, and then establishes all the CUDA worker processes both inside the head node and on all the other GPU nodes. This dask cluster preparation step is only required for jobs that run on multiple nodes using multiple GPUs.

Once the dask scheduler is ready, it writes its configuration to the file scheduler.json inside the working directory ${PBS_O_WORKDIR} and this file is required when launching the client inside the python script `analysis.runs.on.8.GPUs.py`. One possible way to start the client could be

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
client = Client(scheduler_file='scheduler.json')

If the python script runs outside ${PBS_O_WORKDIR}, please pass the full path to the `Client` function.

E3. Use JupyterLab and Dask/GPU Dashboard

Please user ARE to work with Jupyterlab and RAPIDS at Gadi.

Page tree

Running RAPIDS on NCI Gadi