NCI has added 120 Nvidia Kepler K80 GPUs (30 nodes) and 8 Nvidia Pascal P100 (NVLink) GPUs (2 nodes) to Raijin cluster. This document describes how to get started using the GPUs. For further help with using the GPUs, please email email@example.com.
Each Kepler K80 compute node has four Nvidia K80 GPUs with two GPUs on each CPU socket. Each Nvidia K80 GPU card consists of two logical GPUs, therefore each GPU node has eight GPUs in total. GPU 0-3 have affinities to CPU socket 0 and GPU 4-7 have affinities to CPU socket 1. The logical view of the GPU node is shown below:
- 2 x 12-core Intel Haswell E5-2670v3 CPUs (2.3 GHz) in each of 14 nodes
- 2 x 14-core Intel Broadwell E5-2690v4 CPUs (2.6 GHz) in each of the other 16 nodes (Users may observe slightly better performance if their jobs ran on these nodes compared to those with the same code but ran on the above Haswell nodes.)
- 4 x NVIDIA Tesla K80 Accelerator (or 8 GPUs) on each node
- 4992 NVIDIA CUDA cores per K80 (2496 per GPU)
- Up to 2.91 Teraflops double-precision theoretical peak performance per K80
- Up to 8.73 Teraflops single-precision theoretical peak performance per K80
- 24 GB of GDDR5 memory per K80 (12 GB per GPU), totally 96 GB GPU Memory + 128 (18 nodes) or 256 (12 nodes) GB DDR4 CPU Memory per node
- 700 GB of SSD local disk per node
- 480 GB/sec aggregate memory bandwidth per K80 (240 GB/sec per GPU)
- Topology configuration on the GPU nodes is shown below:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU GPU0 X PIX PHB PHB SOC SOC SOC SOC CPU0 GPU1 PIX X PHB PHB SOC SOC SOC SOC CPU0 GPU2 PHB PHB X PIX SOC SOC SOC SOC CPU0 GPU3 PHB PHB PIX X SOC SOC SOC SOC CPU0 GPU4 SOC SOC SOC SOC X PIX PHB PHB CPU1 GPU5 SOC SOC SOC SOC PIX X PHB PHB CPU1 GPU6 SOC SOC SOC SOC PHB PHB X PIX CPU1 GPU7 SOC SOC SOC SOC PHB PHB PIX X CPU1 Legend: X = Self SOC = Path traverses a socket-level link (e.g. QPI) PHB = Path traverses a PCIe host bridge PXB = Path traverses multiple PCIe internal switches PIX = Path traverses a PCIe internal switch CPU0 = Core 0,2,4,...22 CPU1 = Core 1,3,5,...23
– More details of Tesla K80 can be retrieved from: http://www.nvidia.com/object/tesla-k80.html
Each Pascal P100 compute node has four Nvidia Pascal P100 GPUs. GPU 0-3 have affinities to CPU socket 0 and GPU 4-7 have affinities to CPU socket 1. The logical view of the GPU node is shown below:
- 2 x 12-core Intel Broadwell E5-2650 v4 CPUs (2.2 GHz) in each of the two gpupascal nodes.
- 4 x NVIDIA Pascal P100 SXM2 Accelerator (with NVLink between the GPUs) on each node.
- 56 SMs with 64 FP32 CUDA Cores per SM.
- 3584 FP32 CUDA Cores per GPU.
- Up to 5.3 Teraflops double-precision theoretical peak performance per P100.
- Up to 10.6 Teraflops single-precision theoretical peak performance per P100.
- 16 GB of CoWoS HBM2 memory per P100, totally 64 GB GPU Memory + 128 GB DDR4 CPU Memory per node.
- 400 GB of SSD local disk per node.
- 732 GB/sec memory bandwidth.
- Topology configuration on the GPU nodes is shown below:
GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity
GPU0 X NV1 NV1 NV2 PIX 0-11,24-35
GPU1 NV1 X NV2 NV1 PIX 0-11,24-35
GPU2 NV1 NV2 X NV1 SOC 12-23,36-47
GPU3 NV2 NV1 NV1 X SOC 12-23,36-47
mlx5_0 PIX PIX SOC SOC X
Legend:X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
More info: http://www.nvidia.com/object/tesla-p100.html
SU Charge Rate
NCI does not charge service units (SUs) for GPUs; however jobs are charged for the associated CPU usage, which is currently 3SU per core-hour. To maintain optimal CPU-GPU affinity, the user can request minimum of 6 CPUs and 2 GPUs (K80 GPU- which appears to operating system as 2 GPUs) for each job. In order to request more CPUs or GPUs, user has to request multiple of 6 CPUs and a multiple of 2 GPUs.
NCI does not charge service units (SUs) for GPUs; however jobs are charged for the associated CPU usage, which is currently 4SU per core-hour. To maintain optimal CPU-GPU affinity, the user can request minimum of 6 CPUs and 1 GPUs for each job. In order to request more CPUs or GPUs, user has to request multiple of 6 CPUs for every 1 GPUs. Maximum 48 CPUs with total 8 GPUs available.
Preparing Job Script
Access to GPUs is through the Raijin PBS batch system. The following is a typical sample PBS job script with shows the minimum requirements for submitting a job with a GPU-enabled executable.
-l ngpus flag specifies the number of GPUs that will be dedicated to the job. Requests for GPUs must be in multiples of 2, each pair of GPUs is equivalent to one unit of K80. For jobs requiring multiple nodes, all cpus of the nodes must be requested, which means that ncpus must be a multiple of 24, and ngpus must be the corresponding multiple of 8, i.e., for 2 nodes
Once the job is running, you can use
qps_gpu jobid to monitor the gpu utilisation.
The section below on benchmark performance may help you to determine the request on resources.
Compiling CUDA codes
The NVIDIA compiler is called nvcc, and it is provided by loading the cuda module. This module will also give access to the rest of the CUDA Toolkit.
On Raijin, we have three different versions of CUDA modules (6.5, 7.0, 7.5 and 8.0). cuda/6.5 is the default CUDA module.
CUDA Toolkit is a software package that has different components. The main components are:
GPU libraries (cuFFT, cuBLAS, cuSPARSE, cuSOLVER, cuRAND, NPP, Thurst and other math libraries);
Development tools (NVCC compiler, Nsight IDE, Visual profiler, CUDA-GDB debugger and memory analyser)
Reference materials (CUDA C/C++ code samples and documentation)
It is noted that users can compile their CUDA/OpenACC codes on the login nodes (e.g., raijin1) and then run submit PBS jobs to run them on the GPU nodes.
Compute Capability for Kepler K80 is 3.7, however, 3.7 does not work. Please use the following option during compilation:
-gencode arch=compute_35,code=sm_35 Compute Capability for Kepler P100 is 6.0, use the following options during compilation:
Available GPU Software
We currently have the following GPU programs compiled available on raijin:
namd (2.12, 2.11 and 2.10)
- namd2-gpu (openmpi)
- namd2-node-gpu (multicore)
please turn on LREAL=Auto. Large system do not work as it tries to allocate FFT grid in the GPU memory)
Please read here for more details on running VSAP, in particular the section on
gromacs (5.1.0-gpu and 5.1.2-gpu: gmx, gmx_mpi)
lammps/7Dec15-gpu (lmp_gpu, double precision)
amber/16-16.05 (pmemd.cuda for serial and pmemd.cuda.MPI for parallel)
- SPFP - (Default) Uses a combination of single precision for calculation and fixed (integer) precision for accumulation. This approach is believed to provide the optimum tradeoff between accuracy and performance and hence at the time of release is the default model invoked when using the executable pmemd.cuda.
- DPFP - Uses double precision (and double precision equivalent fixed precision) for the entire calculation. This provides for careful regression testing against the CPU code. It makes no additional approximations above and beyond the CPU implementation and would be the model of choice if performance was not a consideration. On NCI's NVIDIA hardware, the performance is substantially less than the SPFP model.
- SPXP - (Experimental) Uses single precision for calculation and a combination of 32 bit integer accumulation strategies to approximate 48 bit precision in the summation stage. This precision model has been designed to provide future proofing of performance on next and later generation hardware designs. It is considered experimental at present and should not be used for production simulations except as a way to test how the model performs.
tensorflow/0.8 and 0.9.0
theano/0.8.1-3.4.3 and 0.9.0.dev2-3.4.3
GPU Benchmark Results
Each unit of K80 in the graph above uses 1 K80 (2GPUs) with 6 Intel Haswell cpus. Each raijin node in the graph above uses 16 Intel Sandybridge cpus.
NAMD: Problem size (1M atoms, STMV)
- NAMD 2.11 was used on cpu and 2.11 multicore on gpu except 48 cpus job was running with NAMD 2.10 + openmpi/1.8.8.
- Multicore runs faster than mpi version by 20% ( 0.62 days/ns vs 0.76 days/ns) on 24cpus.
- 2.11 runs 10% faster than 2.10 on cpus (0.59 days/ns vs 0.62 days/ns).
GROMACS: Problem size (136K atoms, adh_cubic_vsites)
- Gromacs 5.1.0 was compiled with cuda/7.5
LAMMPS: Problem size (32*32*64 grid, FERMI, lj)
- lammps 7Dec15 was compiled with double precision, and cuda/7.5
AMBER: Problem size (1M atom, STMV, PME 4000 steps, NPT 4fs )
- Amber 16-16.05 was used on both cpu and gpu + openmpi/1.8.8
- PME explicit Solvent, STMV NPT HMR 4fs = 1,067,095 atoms
- For the 2K80 job, it takes 4.90 ns/day, similar to using 256 cpus 4.20 ns/day. (larger the better)
- More K80 end up slows down the calculation. Choose your ngpus carefully.
nvBowtie/bowtie2: Problem size (human_g1k_v37.fasta as ref for index. SRR077487_1 and SRR077487_2 for input)
- Both nvBowtie and bowtie2 only support on single node.
- Reference file (fasta) is from:
- Reads (fastq) are from
- For paired end files, the _1 and _2 files are from.
More GPU benchmark results can be found from Nvidia.