...
This function first opens the h5 file to find out the total number of cells, then define defines the batch of data read by each worker accordingly. The parallel read is initiated by calling dask.array.from_delayed which in turn launches all the individual reads. Every read fetches a piece of data no more than the default batch size of 50000. Given the input file contains 1306127 cell records, all of the first 26 tasks read info of from 50000 cells while the 27th read fetches the remaining 6127. By modifying the batch size, you can minimise the time spent for data ingestion. Below is a benchmark showing the batch size and associated normalised read walltime:
...
If data ingestion takes a considerable amount of time in your production jobs, you might want to go through this optimisation carefully. In general, IO operations can benefit from bigger a larger batch size on Lustre filesystems as too many small reads and writes can result in very poor performance in terms of both efficiency and robustness. However, when bigger batch size leads in the case of larger batch sizes leading to workload imbalance, the overall read walltime increases as the entire IO has to wait for the last task to finish. For example, from our benchmark above, a batch size of 200K results in 7 read tasks scheduled on 2 workers and it's very likely that one worker has to wait for the other while it performs the fourth read task.
...
There are several computational intensive tasks in this notebook and we take PCA as an example to show how to run it on multiple GPUs on Gadi.
In the notebook, it shows there are two PCA decomposition functions shipped with the `cuML` package:
...
The example notebook can visualise the cluster results in with plots. Given the current interactive job has no display set, the following modification to save the plots to files is necessary. Once the results are saved to a file, you can download this file to your local PC and inspect the result.results:
Code Block |
---|
python3 >>> sc.pl.tsne(adata, color=["kmeans"],show=False,save="_kmeans.pdf") |
...