Page tree
Skip to end of metadata
Go to start of metadata

NCI has added 120 Nvidia Kepler K80 GPUs (30 nodes) and 8 Nvidia Pascal P100 (NVLink) GPUs (2 nodes) to Raijin cluster. This document describes how to get started using the GPUs. For further help with using the GPUs, please email help@nci.org.au.

System Configuration

Kepler K80:

Each Kepler K80 compute node has four Nvidia K80 GPUs with two GPUs on each CPU socket. Each Nvidia K80 GPU card consists of two logical GPUs, therefore each GPU node has eight GPUs in total. GPU 0-3 have affinities to CPU socket 0 and GPU 4-7 have affinities to CPU socket 1. The logical view of the GPU node is shown below:

 

  • 2 x 12-core Intel Haswell E5-2670v3 CPUs (2.3 GHz) in each of 14 nodes
  • 2 x 14-core Intel Broadwell E5-2690v4 CPUs (2.6 GHz) in each of the other 16 nodes (Users may observe slightly better performance if their jobs ran on these nodes compared to those with the same code but ran on the above Haswell nodes.)
  • 4 x NVIDIA Tesla K80 Accelerator (or 8 GPUs) on each node
  • 4992 NVIDIA CUDA cores per K80 (2496 per GPU)
  • Up to 2.91 Teraflops double-precision theoretical peak performance per K80
  • Up to 8.73 Teraflops single-precision theoretical peak performance per K80
  • 24 GB of GDDR5 memory per K80 (12 GB per GPU), totally 96 GB GPU Memory + 128 (18 nodes) or 256 (12 nodes) GB DDR4 CPU Memory per node
  • 700 GB of SSD local disk per node
  • 480 GB/sec aggregate memory bandwidth per K80 (240 GB/sec per GPU)
  • Topology configuration on the GPU nodes is shown below:
nvidia-smi topo --matrix
          GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU
     GPU0  X   PIX  PHB  PHB  SOC  SOC  SOC  SOC  CPU0
     GPU1 PIX   X   PHB  PHB  SOC  SOC  SOC  SOC  CPU0
     GPU2 PHB  PHB   X   PIX  SOC  SOC  SOC  SOC  CPU0
     GPU3 PHB  PHB  PIX   X   SOC  SOC  SOC  SOC  CPU0
     GPU4 SOC  SOC  SOC  SOC   X   PIX  PHB  PHB  CPU1
     GPU5 SOC  SOC  SOC  SOC  PIX   X   PHB  PHB  CPU1
     GPU6 SOC  SOC  SOC  SOC  PHB  PHB   X   PIX  CPU1
     GPU7 SOC  SOC  SOC  SOC  PHB  PHB  PIX   X   CPU1
 
     Legend:
      X    = Self
      SOC  = Path traverses a socket-level link (e.g. QPI)
      PHB  = Path traverses a PCIe host bridge
      PXB  = Path traverses multiple PCIe internal switches
      PIX  = Path traverses a PCIe internal switch
      CPU0 = Core 0,2,4,...22
      CPU1 = Core 1,3,5,...23

– More details of Tesla K80 can be retrieved from: http://www.nvidia.com/object/tesla-k80.html

Pascal P100:

Each Pascal P100 compute node has four Nvidia Pascal P100 GPUs. GPU 0-3 have affinities to CPU socket 0 and GPU 4-7 have affinities to CPU socket 1.  The logical view of the GPU node is shown below:

  • 2 x 12-core Intel Broadwell E5-2650 v4 CPUs (2.2 GHz) in each of the two gpupascal nodes. 
  • 4 x NVIDIA Pascal P100 SXM2 Accelerator (with NVLink between the GPUs) on each node.
  • 56 SMs with 64 FP32 CUDA Cores per SM.
  • 3584 FP32 CUDA Cores per GPU.
  • Up to 5.3 Teraflops double-precision theoretical peak performance per P100.
  • Up to 10.6 Teraflops single-precision theoretical peak performance per P100.
  • 16 GB of CoWoS HBM2 memory per P100, totally 64 GB GPU Memory + 128 GB DDR4 CPU Memory per node.
  • 400 GB of SSD local disk per node.
  • 732 GB/sec memory bandwidth.
  • Topology configuration on the GPU nodes is shown below:
nvidia-smi topo --matrix
       GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity
GPU0 X NV1 NV1 NV2 PIX 0-11,24-35
GPU1 NV1 X NV2 NV1 PIX 0-11,24-35
GPU2 NV1 NV2 X NV1 SOC 12-23,36-47
GPU3 NV2 NV1 NV1 X SOC 12-23,36-47
mlx5_0 PIX PIX SOC SOC X


Legend:
X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

More info: http://www.nvidia.com/object/tesla-p100.html

To use these Pascal nodes, your application would need to have been built against CUDA 8.0 (or newer).

SU Charge Rate

Kepler K80:

NCI does not charge service units (SUs) for GPUs; however jobs are charged for the associated CPU usage, which is currently 3SU per core-hour. To maintain optimal CPU-GPU affinity, the user can request minimum of 6 CPUs  and 2 GPUs (K80 GPU- which appears to operating system as 2 GPUs) for each job. In order to request more CPUs or GPUs, user has to request multiple of 6 CPUs and a multiple of 2 GPUs.

Pascal P100:

NCI does not charge service units (SUs) for GPUs; however jobs are charged for the associated CPU usage, which is currently 4SU per core-hour. To maintain optimal CPU-GPU affinity, the user can request minimum of 6 CPUs  and 1 GPUs for each job. In order to request more CPUs or GPUs, user has to request multiple of 6 CPUs for every 1 GPUs. Maximum 48 CPUs with total 8 GPUs available.

Preparing Job Script

Access to GPUs is through the Raijin PBS batch system. The following is a typical sample PBS job script with shows the minimum requirements for submitting a job with a GPU-enabled executable.

Kepler K80:

#!/bin/bash
#PBS -q gpu
#PBS -l ngpus=2
#minimum ngpus request is 2, must be a multiple of 2.
#PBS -l ncpus=6 
#minimum ncpus request is 6, must be a multiple of 6, and 3 x ngpus
... # Other PBS resource request 
 
PATH_TO_GPU_EXECUTABLE > output_file

 The -l ngpus flag specifies the number of GPUs that will be dedicated to the job. Requests for GPUs must be in multiples of 2, each pair of GPUs is equivalent to one unit of K80. For jobs requiring multiple nodes, all cpus of the nodes must be requested, which means that ncpus must be a multiple of 24, and ngpus must be the corresponding multiple of 8, i.e., for 2 nodes

#PBS -lngpus=16
#PBS -lncpus=48

Pascal P100:

#!/bin/bash
#PBS -q gpupascal
#PBS -l ngpus=1
#PBS -l ncpus=6 
... # Other PBS resource request 
 
PATH_TO_GPU_EXECUTABLE_COMPILED_WITH_CUDA_8+ > output_file

Once the job is running, you can use qps_gpu jobid to monitor the gpu utilisation.

 


The section below on benchmark performance may help you to determine the request on resources.

 

Compiling CUDA codes

module load cuda
nvcc -o executable source.cu -lcuda -lcudart

The NVIDIA compiler is called nvcc, and it is provided by loading the cuda module. This module will also give access to the rest of the CUDA Toolkit.

On Raijin, we have three different versions of CUDA modules (6.5, 7.0, 7.5 and 8.0). cuda/6.5 is the default CUDA module.

CUDA Toolkit is a software package that has different components. The main components are:

GPU libraries (cuFFT, cuBLAS, cuSPARSE, cuSOLVER, cuRAND, NPP, Thurst and other math libraries);

Development tools (NVCC compiler, Nsight IDE, Visual profiler, CUDA-GDB debugger and memory analyser)

Reference materials (CUDA C/C++ code samples and documentation)

It is noted that users can compile their CUDA/OpenACC codes on the login nodes (e.g., raijin1) and then run submit PBS jobs to run them on the GPU nodes.


Compute Capability for Kepler K80 is 3.7, however, 3.7 does not work. Please use the following option during compilation: -gencode arch=compute_35,code=sm_35 Compute Capability for Kepler P100 is 6.0, use the following options during compilation: -gencode arch=compute_60,code=sm_60

https://developer.nvidia.com/cuda-gpus


Available GPU Software

We currently have the following GPU programs compiled available on raijin:

Chemistry

namd (2.12, 2.11 and 2.10)

  • namd2-gpu (openmpi) 
module load namd/2.12
mpirun namd2-gpu input > output
  • namd2-node-gpu (multicore)
module load namd/2.12
namd2-node-gpu +p ${PBS_NCPUS} input > output

vasp/5.4.1

please turn on LREAL=Auto. Large system do not work as it tries to allocate FFT grid in the GPU memory)

  • vasp_ncl-gpu
  • vasp_std-gpu
  • Please read here for more details on running VSAP, in particular the section on nvidia-cuda-mps-control

    mkdir $PBS_JOBFS/nvidia-mps
    export CUDA_MPS_PIPE_DIRECTORY=$PBS_JOBFS/nvidia-mps
    mkdir $PBS_JOBFS/nvidia-log
    export CUDA_MPS_LOG_DIRECTORY=$PBS_JOBFS/nvidia-log
    nvidia-cuda-mps-control -d
              
    module load vasp/5.4.1
    mpirun vasp_std-gpu > vasp.out

gromacs (5.1.0-gpu and 5.1.2-gpu: gmx, gmx_mpi)

module load gromacs/5.1.2-gpu
mpirun gmx_mpi mdrun ...

lammps/7Dec15-gpu (lmp_gpu, double precision)

module load lammps/7Dec15-gpu
ngpus=$(( PBS_NGPUS<8?PBS_NGPUS:8 ))
mpirun -np $PBS_NCPUS lmp_gpu -sf gpu -pk gpu $ngpus < input > output

amber/16-16.05 (pmemd.cuda for serial and pmemd.cuda.MPI for parallel)

module load cuda/7.5
module load intel-cc/16.0.3.210
module load intel-fc/16.0.3.210
module load openmpi/1.8.8
module load amber/16-16.05

mpirun -np $PBS_NGPUS $AMBERHOME/bin/pmemd.cuda.MPI -O -i mdin -o mdout -inf mdinfo -x mdcrd -r restrt


# Make sure to use the pmemd.cuda.XXXX.MPI executable when using GPU and MPI, where XXXX=SPFP, DPFP, or SPXP.
  • SPFP - (Default) Uses a combination of single precision for calculation and fixed (integer) precision for accumulation. This approach is believed to provide the optimum tradeoff between accuracy and performance and hence at the time of release is the default model invoked when using the executable pmemd.cuda.
     
  • DPFP - Uses double precision (and double precision equivalent fixed precision) for the entire calculation. This provides for careful regression testing against the CPU code. It makes no additional approximations above and beyond the CPU implementation and would be the model of choice if performance was not a consideration. On NCI's NVIDIA hardware, the performance is substantially less than the SPFP model.
     
  • SPXP - (Experimental) Uses single precision for calculation and a combination of 32 bit integer accumulation strategies to approximate 48 bit precision in the summation stage. This precision model has been designed to provide future proofing of performance on next and later generation hardware designs. It is considered experimental at present and should not be used for production simulations except as a way to test how the model performs.

Bioinformatics

  • nvbio/1.1

  • nextgenmap/0.5.0

Deep Learning

  • tensorflow/0.8 and 0.9.0

  • theano/0.8.1-3.4.3 and 0.9.0.dev2-3.4.3

  • lasagne/0.2.dev1-3.4.3

  • cudnn/5.1.3-cuda7.5

  • cudnn/5.1.10-cuda8.0


Others

  • Matlab

  • openmpi/cuda/7.5/1.10.2*

GPU Benchmark Results

Each unit of K80 in the graph above uses 1 K80 (2GPUs) with 6 Intel Haswell cpus. Each raijin node in the graph above uses 16 Intel Sandybridge cpus.

 

NAMD: Problem size (1M atoms, STMV)

  • NAMD 2.11 was used on cpu and 2.11 multicore on gpu except 48 cpus job was running with NAMD 2.10 + openmpi/1.8.8.
  • Multicore runs faster than mpi version by 20% ( 0.62 days/ns vs 0.76 days/ns) on 24cpus.
  • 2.11 runs 10% faster than 2.10 on cpus (0.59 days/ns vs 0.62 days/ns).

GROMACS: Problem size (136K atoms, adh_cubic_vsites)

  • Gromacs 5.1.0 was compiled with cuda/7.5

LAMMPS: Problem size (32*32*64 grid, FERMI, lj)

  • lammps 7Dec15 was compiled with double precision, and cuda/7.5

AMBER: Problem size (1M atom, STMV, PME 4000 steps, NPT 4fs )

  • Amber 16-16.05 was used on both cpu and gpu + openmpi/1.8.8
  • PME explicit Solvent, STMV NPT HMR 4fs =  1,067,095 atoms
  • For the 2K80 job, it takes 4.90 ns/day, similar to using 256 cpus 4.20 ns/day. (larger the better)
  • More K80 end up  slows down the calculation. Choose your ngpus carefully.

nvBowtie/bowtie2: Problem size (human_g1k_v37.fasta as ref for index. SRR077487_1 and SRR077487_2 for input)

More GPU benchmark results can be found from Nvidia.