Rapids Single Cell Example

There exists a plethora of online material related to RAPIDS and one such example is this blog [https://developer.nvidia.com/blog/accelerating-single-cell-genomic-analysis-using-rapids/] which showcases how to accelerate single cell genomic analysis using RAPIDS. The authors of this blog released several notebooks to showcase their work and these notebooks can be run on Gadi with some modifications. For the following example, we will look at the notebook [https://github.com/clara-parabricks/rapids-single-cell-examples/blob/v2021.12.0/notebooks/1M_brain_gpu_analysis_multigpu.ipynb] and show how to modify our RAPIDS environment and run this on Gadi.

Install Missing Packages

Most of the packages imported in this notebook are available in rapids/22.02. There are only two missing packages, scanpy and anndata. Please follow the instructions provided on this page under `Work with Other Python Packages` to gain an understanding of how to install additional packages.

For this example, we will use pip to install scanpy:

login-node $ python3 -m pip install -v --upgrade-strategy only-if-needed --prefix $INSTALL_DIR scanpy

where INSTALL_DIR defines where scanpy will be installed.

As scanpy installs anndata automatically as part of its dependencies, you should see both of these packages available in $INSTALL_DIR/lib/python3.9/site-packages after successfully running the command above.

login-node $ ls $INSTALL_DIR/lib/python3.9/site-packages
anndata                  natsort-8.1.0.dist-info  scanpy                  sinfo-0.3.4.dist-info        tables
anndata-0.8.0.dist-info  numexpr                  scanpy-1.8.2.dist-info  stdlib_list                  tables-3.7.0.dist-info
natsort                  numexpr-2.8.1.dist-info  sinfo                   stdlib_list-0.8.0.dist-info  tables.libs

To test scanpy, you can try running the test suite that comes with it:

git clone -b 1.8.2 https://github.com/scverse/scanpy.git
cd scanpy
py.test > ../scanpy.test.log

Given the scanpy tests take less than 10 minutes to complete, you should be able to run these tests on the login node. For more intensive tests, please run using a PBS job.

There are some tests in the scanpy test suite that fail as their RMS values are greater than the expected tolerance. These failed tests are a good demonstration of the risks involved when running applications that are not built from source on Gadi. As the failed tests are not exceeding the tolerance limit by a large amount, we will proceed with using this installation for the following example.

Download Input Data and Auxiliary Function Code

The example notebook has a short section of code that downloads the input data. On Gadi, the downloading would need to be run on either the login nodes or the copyq nodes as Gadi compute nodes have no access to external networks. Given the dataset in this example is small, we will download it on the login node directly:

login-node $ mkdir -p $WORKDIR
login-node $ wget https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad ‐P $WORKDIR

where WORKDIR defines the path to the working directory in which you will run the notebook.

The notebook also calls auxiliary functions defined in another file hosted in the same GitHub repository. This file would also need to be downloaded to your working directory:

login-node $ wget https://raw.githubusercontent.com/clara-parabricks/rapids-single-cell-examples/5ca2a69d852c2a7ad843b57209e8fdce450336f7/notebooks/rapids_scanpy_funcs.py -P $WORKDIR

Note that we use the v2021.12.0 version notebook and this file may be different for other branches.

Start an Interactive Job on a GPU Node and Initiate the Dask LocalCUDACluster

Once the required python packages, input data and auxiliary functions all available on Gadi, we can run our example notebook on a GPU node. To gain a deeper understanding of the inner workings of this notebook, it is recommended to run it interactively.

To submit the interactive job:

login-node $ cd $WORKDIR
login-node $ qsub -I -P${PROJECT} -vINSTALL_DIR=$INSTALL_DIR -qgpuvolta -lwalltime=00:30:00,ncpus=24,ngpus=2,mem=180GB,jobfs=200GB,storage=gdata/dk92+gdata/${PROJECT},other=hyperthread,wd

Note that this interactive job assumes your default project $PROJECT has enough SU to support a 2-GPU job for half an hour. If this is not the case, please replace $PROJECT with a project code that has sufficient compute resources. For more information on how to look up resource availability in projects, please see the Gadi User Guide.

The interactive job also assumes the directory $INSTALL_DIR is located inside /g/data/${PROJECT} and $WORKDIR inside /scratch/${PROJECT} where ${PROJECT} defines the project code that supports this job. If this is not the case, please revise the string passed to the PBS -lstorage directive accordingly. More information on PBS directives can be found here.

Once the job is ready, prepare the environment and initiate the dask cluster inside python3. Please note that when editing your PYTHONPATH, the variable INSTALL_DIR that was passed through the PBS job submission line is used. If this INSTALL_DIR is not accessible from the job, importing the scanpy and anndata packages would fail.

gpu-node $ module use /g/data/dk92/apps/Modules/modulefiles/
gpu-node $ module load rapids/22.02
gpu-node $ export PYTHONPATH=$INSTALL_DIR/lib/python3.9/site-packages:$PYTHONPATH
gpu-node $ python3
.
.
.
python3 >>> from dask_cuda import initialize, LocalCUDACluster
python3 >>> from dask.distributed import Client, default_client
...
python3 >>> cluster = LocalCUDACluster()
python3 >>> client = Client(cluster)
python3 >>> client
<Client: 'tcp://127.0.0.1:40259' processes=2 threads=2, memory=180.00 GiB>

When initiating the LocalCUDACluster, no argument is required as long as it expects workers to run on the same compute node. Running LocalCUDACluster() will start a local scheduler ready to connect with the same number of workers as the number of GPUs available inside the job. Since the Gadi gpuvolta queue has 4 GPUs per node, this method is only valid for jobs that require no more than 4 GPUs. To learn how to run tasks using more than 4 GPUs on multiple nodes, follow the instructions in Example 2 on this page.

Load Input Data using all the Dask CUDA Workers

In the example notebook, the input is ingested by calling the function read_with_filter defined in the file rapids_scanpy_funcs.py:

python3 >>> input_file = "1M_brain_cells_10X.sparse.h5ad"
python3 >>> min_genes_per_cell = 200
python3 >>> max_genes_per_cell = 6000
python3 >>> dask_sparse_arr, genes, query = rapids_scanpy_funcs.read_with_filter(client,
...                                                        input_file,
...                                                        min_genes_per_cell=min_genes_per_cell,
...                                                        max_genes_per_cell=max_genes_per_cell,
...                                                        partial_post_processor=partial_post_processor)

This function first opens the h5 file to find out the total number of cells, then defines the batch of data read by each worker accordingly. The parallel read is initiated by calling dask.array.from_delayed which in turn launches all the individual reads. Every read fetches a piece of data no more than the default batch size of 50000. Given the input file contains 1306127 cell records, all of the first 26 tasks read info from 50000 cells while the 27th read fetches the remaining 6127. By modifying the batch size, you can minimise the time spent for data ingestion. Below is a benchmark showing the batch size and associated normalised read walltime:

batch size	10000	25000	50000	73000	100000	165000	200000
normalised read walltime	1.084	0.990	1	1.006	1.050	1.038	1.114

If data ingestion takes a considerable amount of time in your production jobs, you might want to go through this optimisation carefully. In general, IO operations can benefit from a larger batch size on Lustre filesystems as too many small reads and writes can result in very poor performance in terms of both efficiency and robustness. However, in the case of larger batch sizes leading to workload imbalance, the overall read walltime increases as the entire IO has to wait for the last task to finish. For example, from our benchmark above, a batch size of 200K results in 7 read tasks scheduled on 2 workers and it's very likely that one worker has to wait for the other while it performs the fourth read task.

Even though read_with_filter does more than IO operations, the trend is clear that time spent in IO dominants the overall overhead.

Run Data Analysis on Multiple GPUs

There are several computational intensive tasks in this notebook and we take PCA as an example to show how to run it on multiple GPUs on Gadi.

In the notebook, there are two PCA decomposition functions shipped with the `cuML` package:

from cuml.decomposition import PCA
from cuml.dask.decomposition import PCA

Only the cuml.dask.decompostiton package runs on multiple GPUs. You can find similar functions in this section of the cuml API references.

Save Plots to Files

The example notebook can visualise the cluster results with plots. Given the current interactive job has no display set, the following modification to save the plots to files is necessary. Once the results are saved to a file, you can download this file to your local PC and inspect the results:

python3 >>> sc.pl.tsne(adata, color=["kmeans"],show=False,save="_kmeans.pdf")

The code above tells the scanpy plotting tool not to show the figure but instead, save it to a pdf file. Once it finishes writing, this image can be found under `figures/tsne_kmeans.pdf` inside your working directory.

Page tree