High-performance computing (HPC) accelerators can scale applications across CPUs, GPUs, and clusters for faster and more efficient computations.

Choosing the right tool depends on project needs: Dask scales Python workflows across cores, GPUs, or clusters with minimal changes, CuPy and CUDA focus on GPU acceleration, OpenACC and OpenMP enable directive-based parallelism on CPUs or GPUs, and MPI provides large-scale distributed communication.

A good rule of thumb is to start with high-level tools (Dask, CuPy, OpenACC) if productivity and ease of adoption are priorities, and move toward lower-level or distributed tools (CUDA, OpenMP, MPI) when performance or scalability requirements demand it.

Common HPC and Accelerator Tools

Tool	Category	Language/API	Parallelism Type	Target Hardware	Typical Use Case
Dask	Python library	Python	Data-parallel GPU/CPU	CPU, GPU, clusters	Scaling NumPy/Pandas/Scikit-learn, workflow parallelism, out-of-core datasets
CuPy	Python library	Python (NumPy-compatible API)	Data-parallel GPU	NVIDIA GPUs	Data science, ML prototyping, array and matrix computations
CUDA	GPU programming platform	C/C++ API (also Python via PyCUDA, Numba), Fortan	Data-parallel GPU	NVIDIA GPUs	Writing custom GPU kernels, fine-tuned performance, deep learning frameworks
OpenACC	Compiler directives	C/C++, Fortran pragmas	Data-parallel GPU/CPU	GPUs & other accelerators	Annotating loops to offload work to accelerators
OpenMP	Compiler directives	C/C++, Fortran pragmas	Shared-memory CPU	Multi-core CPUs, GPUs	Parallelising loops and regions on a single node, hybrid MPI+OpenMP, multi-threaded codes
MPI	Library & standard	C/C++, Fortran, Python (via mpi4py)	Distributed-memory	Clusters & networks	Large-scale distributed computing, simulations

What is an Out-of-Core Dataset?

An out-of-core dataset is a dataset too large to fit entirely in your computer’s main memory (RAM). Instead of loading everything at once, tools like Dask process data in chunks — streaming pieces into memory, computing results, and discarding them before loading the next chunk.

Example: A CSV file that is 200 GB in size but your laptop only has 16 GB RAM. Dask can process it chunk by chunk, whereas Pandas would fail trying to load the whole file.
Benefit: Enables big-data analysis on machines with limited memory.

Page tree

NCI HPC Toolkits

Common HPC and Accelerator Tools