There are a lot of online materials to follow. For example, in this blog [https://developer.nvidia.com/blog/accelerating-single-cell-genomic-analysis-using-rapids/], the authors release several notebooks to showcase their work. The code is expected to run as it is on Gadi with some modifications. Here we take the notebook [https://github.com/clara-parabricks/rapids-single-cell-examples/blob/v2021.12.0/notebooks/1M_brain_gpu_analysis_multigpu.ipynb] as an example to show how to run it on Gadi and what the modifications can be.
Install Missing Packages
Most of the packages imported in this notebook are available in rapids/2022.02. There are only two missing, scanpy and anndata. Follow the instructions provided in the section `Work with Other Python Packages` on this page to understand how to install topup packages. On the same page, we also show how to prepare the environment to use RAPIDS.
In this example the pip install command can be
login-node $ python3 -m pip install -v --upgrade-strategy only-if-needed --prefix $INSTALL_DIR scanpy
where INSTALL_DIR is the variable defines where scanpy installs in.
As scanpy installs anndata automatically as part of its dependencies, you can see both the packages in $INSTALL_DIR/lib/python3.9/site-packages after the above command runs successfully.
login-node $ ls $INSTALL_DIR/lib/python3.9/site-packages anndata natsort-8.1.0.dist-info scanpy sinfo-0.3.4.dist-info tables anndata-0.8.0.dist-info numexpr scanpy-1.8.2.dist-info stdlib_list tables-3.7.0.dist-info natsort numexpr-2.8.1.dist-info sinfo stdlib_list-0.8.0.dist-info tables.libs
To try whether and how well it works, run the test suite comes with it as below.
git clone -b 1.8.2 https://github.com/scverse/scanpy.git cd scanpy py.test > ../scanpy.test.log
Given scanpy tests take less than 10 minutes to finish, it is OK to run on the login node. For more intensive tests, please run in a PBS job.
There are some tests in the scanpy test suite fail as their RMS values are greater than the expected tolerance. The failed tests are good demonstrations of the risk of running applications that are not building from source on Gadi. Given they are not exceeding the limit by a large amount, we still use this installation in our following example.
Download Input Data and Auxiliary Function Code
The notebook has a short section of code to download the input data. On Gadi, it needs to run on either the login nodes or the copyq nodes because Gadi compute nodes have no access to the external networks. Given the dataset is not big we run the download on the login node directly as the following
login-node $ mkdir -p $WORKDIR login-node $ wget https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad ‐P $WORKDIR
where WORKDIR is the variable defines the path to the working directory in which you run the notebook.
In the notebook it calls auxiliary functions defined in anther file hosted in the same GitHub repository. Download the file to the working directory as below.
login-node $ wget https://raw.githubusercontent.com/clara-parabricks/rapids-single-cell-examples/5ca2a69d852c2a7ad843b57209e8fdce450336f7/notebooks/rapids_scanpy_funcs.py -P $WORKDIR
Please note, we use the v2021.12.0 version notebook. The file can be different in other branches.
Start an Interactive Job on a GPU Node and Initiate the Dask LocalCUDACluster
When the required python packages, input data, and the auxiliary functions all become available on Gadi, it is time to run the code on the GPU node. To get more understanding out of the notebook, it is recommended to run it interactively.
To submit the interactive job, try
login-node $ cd $WORKDIR login-node $ qsub -I -P${PROJECT} -vINSTALL_DIR=$INSTALL_DIR -qgpuvolta -lwalltime=00:30:00,ncpus=24,ngpus=2,mem=180GB,jobfs=200GB,storage=gdata/dk92+gdata/${PROJECT},other=hyperthread,wd
Please note, it assumes your default project $PROJECT has enough SU to support a 2-GPU job for half an hour. If not so, please replace with the project code which gives you the access to compute resource. To learn how to look up resource availability in projects, read Gadi User Guide.
It also assumes the directory $INSTALL_DIR is located inside /g/data/${PROJECT} and $WORKDIR inside /scratch/$PROJECT where PROJECT is the variable defines the project which supports this job. If not so, please revise the string passed to the PBS -lstorage directive accordingly, see more details here
Once the job is ready, prepare the environment and initiate the dask cluster inside python3. Please note, when editing PYTHONPATH, the variable INSTALL_DIR passed through the PBS job submission line is used. If it is not accessible from the job, import the scanpy and anndata packages would fail.
gpu-node $ module use /g/data/dk92/apps/Modules/modulefiles/ gpu-node $ module load rapids/22.02 gpu-node $ export PYTHONPATH=$INSTALL_DIR/lib/python3.9/site-packages:$PYTHONPATH gpu-node $ python3 . . . python3 >>> from dask_cuda import initialize, LocalCUDACluster python3 >>> from dask.distributed import Client, default_client ... python3 >>> cluster = LocalCUDACluster() python3 >>> client = Client(cluster) python3 >>> client <Client: 'tcp://127.0.0.1:40259' processes=2 threads=2, memory=180.00 GiB>
The first necessary modification is shown above. When initiate the LocalCUDACluster, no argument is required as long as it expects workers to run on the same compute node. By running LocalCUDACluster(), it starts a local scheduler ready to connect with the same number of workers as the number of GPUs available inside the job. Since Gadi gpuvolta queue has no more than 4 GPUs in a single node, this method is valid for all jobs requesting no more than 4 GPUs. To learn how to run tasks using more than 4 GPUs on multiple nodes, follow the instructions in Example 2 on this page.
Load Input Data using all the Dask CUDA Workers
In the notebook, the input is ingested by calling the function read_with_filter defined in the file rapids_scanpy_funcs.py as the following.
python3 >>> input_file = "1M_brain_cells_10X.sparse.h5ad" python3 >>> min_genes_per_cell = 200 python3 >>> max_genes_per_cell = 6000 python3 >>> dask_sparse_arr, genes, query = rapids_scanpy_funcs.read_with_filter(client, ... input_file, ... min_genes_per_cell=min_genes_per_cell, ... max_genes_per_cell=max_genes_per_cell, ... partial_post_processor=partial_post_processor)
Inside the function, it first opens the h5 file to find out the total number of cells, then define the batch of data read by each worker accordingly. The parallel read is initiated by calling dask.array.from_delayed which in turn launches all the individual read. Every read fetches a piece of data no more than the default batch size 50000. Given the input file contains 1306127 cell records, all of the first 26 tasks read info of 50000 cells while the 27th read fetches the remaining 6127. By modifying the batch size, the minimal time spent in data ingestion can be reached. A search result is shown below.
batch size | 10000 | 25000 | 50000 | 73000 | 100000 | 165000 | 200000 |
---|---|---|---|---|---|---|---|
normalised read walltime | 1.084 | 0.990 | 1 | 1.006 | 1.050 | 1.038 | 1.114 |
If data ingestion takes a considerable part of time in your production jobs, you might want to go through this optimisation carefully. In general, IO operations benefit from bigger batch size on Lustre filesystems as too many small reads and writes can result in very poor performance in terms of both efficiency and robustness. However, when bigger batch size leads to workload imbalance, the overall read walltime increases as well because the entire IO has to wait for the last task to finish. For example, at the batch size of 200K, there are 7 read tasks scheduled on 2 workers in this example, it is very likely that one worker has to wait for the other while it performs the fourth read task.
Even though read_with_filter does more than IO operations, the trend is clear as time spend in IO dominants the overall overhead.
Run Data Analysis on Multiple GPUs
There are several computational intensive tasks in this notebook and we take PCA as an example to show how to run it on multiple GPUs on Gadi.
In the notebook, it shows there are two PCA decomposition functions shipped with the `cuML` package, see details as shown below
from cuml.decomposition import PCA from cuml.dask.decomposition import PCA
while only the second one from cuml.dask.decompostiton runs on multiple GPUs. You can find similar functions in this section of the cuml API references.
Save Plots to Files
In the notebook, it visualises the cluster result in plots. Given the current interactive job has no display set, the following modification to save the plots to files is necessary. Once the result is saved to a file, download the file to your local PC, open the file to inspect the result.
python3 >>> sc.pl.tsne(adata, color=["kmeans"],show=False,save="_kmeans.pdf")
It tells the scanpy plotting tool not to show the figure, instead, save it to a pdf file. Once it finishes writing, the image can be find in the file `figures/tsne_kmeans.pdf` inside the working directory.