Rapids Single Cell Example

There are a lot of online materials to follow. For example, in this blog [https://developer.nvidia.com/blog/accelerating-single-cell-genomic-analysis-using-rapids/], the authors release several notebooks to showcase their work. The code is expected to run as it is on Gadi with some modifications. Here we take the notebook [https://github.com/clara-parabricks/rapids-single-cell-examples/blob/v2021.12.0/notebooks/1M_brain_gpu_analysis_multigpu.ipynb] as an example to show how to run it on Gadi and what the modifications can be.

Install Missing Packages

Most of the packages imported in this notebook are available in rapids/2022.02. There are only two missing, scanpy and anndata. Follow the instructions provided in the section `Work with Other Python Packages` on this page to understand how to install topup packages. On the same page, we also show how to prepare the environment to use RAPIDS.

In this example the pip install command can be

login-node $ python3 -m pip install -v --upgrade-strategy only-if-needed --prefix $INSTALL_DIR scanpy

where INSTALL_DIR is the variable defines where scanpy installs in.

As scanpy installs anndata automatically as part of its dependencies, you can see both the packages in $INSTALL_DIR/lib/python3.9/site-packages after the above command runs successfully.

login-node $ ls $INSTALL_DIR/lib/python3.9/site-packages
anndata                  natsort-8.1.0.dist-info  scanpy                  sinfo-0.3.4.dist-info        tables
anndata-0.8.0.dist-info  numexpr                  scanpy-1.8.2.dist-info  stdlib_list                  tables-3.7.0.dist-info
natsort                  numexpr-2.8.1.dist-info  sinfo                   stdlib_list-0.8.0.dist-info  tables.libs

To try whether and how well it works, run the test suite comes with it as below.

git clone -b 1.8.2 https://github.com/scverse/scanpy.git
cd scanpy
py.test > ../scanpy.test.log

Given scanpy tests take less than 10 minutes to finish, it is OK to run on the login node. For more intensive tests, please run in a PBS job.

There are some tests in the scanpy test suite fail as their RMS values are greater than the expected tolerance. The failed tests are good demonstrations of the risk of running applications that are not building from source on Gadi. Given they are not exceeding the limit by a large amount, we still use this installation in our following example.

Download Input Data and Auxiliary Function Code

The notebook has a short section of code to download the input data. On Gadi, it needs to run on either the login nodes or the copyq nodes because Gadi compute nodes have no access to the external networks. Given the dataset is not big we run the download on the login node directly as the following

login-node $ mkdir -p $WORKDIR
login-node $ wget https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad ‐P $WORKDIR

where WORKDIR is the variable defines the path to the working directory in which you run the notebook.

In the notebook it calls auxiliary functions defined in anther file hosted in the same GitHub repository. Download the file to the working directory as below.

login-node $ wget https://raw.githubusercontent.com/clara-parabricks/rapids-single-cell-examples/5ca2a69d852c2a7ad843b57209e8fdce450336f7/notebooks/rapids_scanpy_funcs.py -P $WORKDIR

Please note, we use the v2021.12.0 version notebook. The file can be different in other branches.

Start an Interactive Job on a GPU Node and Initiate the Dask LocalCUDACluster

When the required python packages, input data, and the auxiliary functions all become available on Gadi, it is time to run the code on the GPU node. To get more understanding out of the notebook, it is recommended to run it interactively.

To submit the interactive job, try

login-node $ cd $WORKDIR
login-node $ qsub -I -P${PROJECT} -vINSTALL_DIR=$INSTALL_DIR -qgpuvolta -lwalltime=00:30:00,ncpus=24,ngpus=2,mem=180GB,jobfs=200GB,storage=gdata/dk92+gdata/${PROJECT},other=hyperthread,wd

Please note, it assumes your default project $PROJECT has enough SU to support a 2-GPU job for half an hour. If not so, please replace with the project code which gives you the access to compute resource. To learn how to look up resource availability in projects, read Gadi User Guide.

It also assumes the directory $INSTALL_DIR is located inside /g/data/${PROJECT} and $WORKDIR inside /scratch/$PROJECT where PROJECT is the variable defines the project which supports this job. If not so, please revise the string passed to the PBS -lstorage directive accordingly, see more details here

Once the job is ready, prepare the environment and initiate the dask cluster inside python3. Please note, when editing PYTHONPATH, the variable INSTALL_DIR passed through the PBS job submission line is used. If it is not accessible from the job, import the scanpy and anndata packages would fail.

gpu-node $ module use /g/data/dk92/apps/Modules/modulefiles/
gpu-node $ module load rapids/22.02
gpu-node $ export PYTHONPATH=$INSTALL_DIR/lib/python3.9/site-packages:$PYTHONPATH
gpu-node $ python3
.
.
.
python3 >>> from dask_cuda import initialize, LocalCUDACluster
python3 >>> from dask.distributed import Client, default_client
...
python3 >>> cluster = LocalCUDACluster()
python3 >>> client = Client(cluster)
python3 >>> client
<Client: 'tcp://127.0.0.1:40259' processes=2 threads=2, memory=180.00 GiB>

The first necessary modification is shown above. When initiate the LocalCUDACluster, no argument is required as long as it expects workers to run on the same compute node. By running LocalCUDACluster(), it starts a local scheduler ready to connect with the same number of workers as the number of GPUs available inside the job. Since Gadi gpuvolta queue has no more than 4 GPUs in a single node, this method is valid for all jobs requesting no more than 4 GPUs. To learn how to run tasks using more than 4 GPUs on multiple nodes, follow the instructions in Example 2 on this page.

Load Input Data using all the Dask CUDA Workers

In the notebook, the input is ingested by calling the function read_with_filter defined in the file rapids_scanpy_funcs.py as the following.

python3 >>> input_file = "1M_brain_cells_10X.sparse.h5ad"
python3 >>> min_genes_per_cell = 200
python3 >>> max_genes_per_cell = 6000
python3 >>> dask_sparse_arr, genes, query = rapids_scanpy_funcs.read_with_filter(client,
...                                                        input_file,
...                                                        min_genes_per_cell=min_genes_per_cell,
...                                                        max_genes_per_cell=max_genes_per_cell,
...                                                        partial_post_processor=partial_post_processor)

Inside the function, it first opens the h5 file to find out the total number of cells, then define the batch of data read by each worker accordingly. The parallel read is initiated by calling dask.array.from_delayed which in turn launches all the individual read. Every read fetches a piece of data no more than the default batch size 50000. Given the input file contains 1306127 cell records, all of the first 26 tasks read info of 50000 cells while the 27th read fetches the remaining 6127. By modifying the batch size, the minimal time spent in data ingestion can be reached. A search result is shown below.

batch size	10000	25000	50000	73000	100000	165000	200000
normalised read walltime	1.084	0.990	1	1.006	1.050	1.038	1.114

If data ingestion takes a considerable part of time in your production jobs, you might want to go through this optimisation carefully. In general, IO operations benefit from bigger batch size on Lustre filesystems as too many small reads and writes can result in very poor performance in terms of both efficiency and robustness. However, when bigger batch size leads to workload imbalance, the overall read walltime increases as well because the entire IO has to wait for the last task to finish. For example, at the batch size of 200K, there are 7 read tasks scheduled on 2 workers in this example, it is very likely that one worker has to wait for the other while it performs the fourth read task.

Even though read_with_filter does more than IO operations, the trend is clear as time spend in IO dominants the overall overhead.

Run Data Analysis on Multiple GPUs

There are several computational intensive tasks in this notebook and we take PCA as an example to show how to run it on multiple GPUs on Gadi.

In the notebook, it shows there are two PCA decomposition functions shipped with the `cuML` package, see details as shown below

from cuml.decomposition import PCA
from cuml.dask.decomposition import PCA

while only the second one from cuml.dask.decompostiton runs on multiple GPUs. You can find similar functions in this section of the cuml API references.

Save Plots to Files

In the notebook, it visualises the cluster result in plots. Given the current interactive job has no display set, the following modification to save the plots to files is necessary. Once the result is saved to a file, download the file to your local PC, open the file to inspect the result.

python3 >>> sc.pl.tsne(adata, color=["kmeans"],show=False,save="_kmeans.pdf")

It tells the scanpy plotting tool not to show the figure, instead, save it to a pdf file. Once it finishes writing, the image can be find in the file `figures/tsne_kmeans.pdf` inside the working directory.

Page tree