Page tree
Skip to end of metadata
Go to start of metadata

NCI has added 32 nodes each housing an Intel's second generation Xeon-Phi processor (Knights Landing or KNL). This document provides a getting started guide with regards to using the KNL nodes. For further help please email help@nci.org.au.

System Configuration

Each KNL node consists of:

  • 1 x 64-core (256 threads with hyperthreading) Intel Xeon Phi 7230 CPU with base clock speed 1.30 GHz (32 double-precision FLOPs/cycle), providing a peak performance of 2.6624 TFLOPs (85.2TFlops in total for all 32 nodes)
  • 192 GB DDR4-2400 RAM, with max memory bandwidth 115.2 GB/s 
  • 16 GB on package high-bandwidth (at ~380 GB/s) MCDRAM, used as L3 cache for the DDR4 RAM
  • 400 GB SSD local disk

  • 100 Gb/s Infiniband interconnect between KNL nodes, shared by the 64 cores on a node (it is thus easy to saturate the bandwidth if the communication pattern is not good)
  • 56 Gb/s Infiniband interconnect from KNL nodes to Raijin storage

The KNL nodes run CentOS 6.8 and Linux 3.10.0 kernel. They support Intel AVX-512 instruction set extensions.

SU Charge Rate

Jobs running on the KNL nodes are charged at 0.25SU per core-hour. A job is charged 16SUs per hour per KNL node (64 cores). The charge rate is thus the same per KNL node as per normal Raijin node (16 cores).

Compiling Code for KNL

Since KNL nodes are binary compatible with legacy x86 instruction set, any code compiled for normal Raijin compute nodes will run on these nodes. However, specific compiler option is needed to generate AVX-512 instructions to derive better performance from these nodes.

Intel Compilers

Version 15.0 and newer of the Intel compilers can generate these instructions if you specify the -xMIC-AVX512 flag:

  • For Intel C/C++ compilers:
module load intel-cc/17.0.0.098
icc -xMIC-AVX512 -O3 -o executable source.c
icpc -xMIC-AVX512 -O3 -o executable source.cc
  • For Intel Fortran compiler:
module load intel-fc/17.0.0.098
ifort -xMIC-AVX512 -O3 -o executable source.f

GNU Compilers

While version 4.9 and newer of the GNU compilers can also generate AVX-512 instruction with appropriate compiler flags, they rely on the system assembler to convert the assembly into machine code. Unfortunately, the default assembler on Raijin is too old to understand these instructions, and will likely fail with errors such as "no such instruction" and "invalid suffix or operands". The Intel compilers don't suffer from this as they use their own built-in assembler. If you must use the GNU compilers and want to generate AVX-512 instructions, you'll need to use a newer assembler (available by loading the binutils module) as well as one of the newer compilers.

  • For GNU C/C++ compilers:

 

module load gcc/5.2.0 binutils/2.25
gcc -mavx512f -mavx512er -mavx512cd -mavx512pf -O3 -o executable source.c
g++ -mavx512f -mavx512er -mavx512cd -mavx512pf -O3 -o executable source.cc
  • For GNU Fortran:
module load gcc/5.2.0 binutils/2.25
gfortran -mavx512f -mavx512er -mavx512cd -mavx512pf -O3 -o executable source.f

Math Libraries

If your code is using functions classified under BLAS, LAPACK, FFT or other Function Domains supported by Intel Math Kernel Library (MKL), it is recommended to use MKL library functions in your code since MKL is tuned for performance on KNL nodes. In order to use MKL libraries, load the intel-mkl/16.0.3.210 module and link with MKL libraries. Intel MKL Link Line Advisor can help to get the compiler and linker options.

Preparing PBS Job Script

The KNL nodes are accessible via the PBS "knl" queue. Jobs need to request a multiple of 64 cpus (a full KNL node). To request a single KNL node with a certain amount of memory (say 64GB), the following PBS job script would be required:

#!/bin/bash
#PBS -q knl
#PBS -l ncpus=64
#PBS -l other=hyperthread
#PBS -l mem=64GB
#PBS -l wd
... # Other PBS resource requests
 
PATH_TO_EXECUTABLE > output_file

Note the "-l other=hyperthread" option. This ensures the job runs on all hyperthreads of the KNL node, i.e. would effectively use 4 x 64 = 256 threads in this case. Without this option, the job would only use the physical thread of each core and would therefore run 64 threads only. For many applications we noticed a better performance when oversubscribing using hyperthreading.

In order to request more than one KNL node, specify a multiple of 64 cores. For example, if 4 nodes are required, specify "#PBS -l ncpus=256" in the above script. 

Available Software

As stated above, code running on normal Raijin compute nodes should run directly on the KNL nodes. However, for better performance a few commonly used applications have been rebuilt for these nodes. 

NAMD (2.11)

  • namd2-knl (openmpi)

          Using hyperthreading:

#PBS -l ncpus=64
#PBS -l other=hyperthread
...
module load namd/2.11
mpirun -np 256 --report-bindings --oversubscribe --map-by hwthread namd2-knl input > output
  • namd2-node-knl (openmp)

          Using hyperthreading:

#PBS -l ncpus=64
#PBS -l other=hyperthread
...
export OMP_NUM_THREADS=256
module load namd/2.11
namd2-node-knl +p 256 input > output

Quantum Espresso (5.4.0-knl)

module load espresso/5.4.0-knl
mpirun pw.x -input input > output 

VASP (5.4.1)

  • vasp_std-knl
  • vasp_gam-knl
  • vasp_ncl-knl
moduel load vasp/5.4.1
mpirun -np $PBS_NCPUS vasp_executable > vasp.out

Gromacs (5.1.3-knl)

  • gmx
  • gmx_mpi
module load gromacs/5.1.3-knl
mpirun gmx_mpi mdrun ...

Benchmark Results

        

       

NAMD: Problem size (1M atoms, STMV)

  • NAMD 2.11 + openmpi/1.10.0
  • One KNL node with openmpi + hyperthreading is 1.36x speedup compared with a Raijin compute node.

Quantum Espresso

  • Only PWSCF (Plane-Wave Self-Consistent Field) is available for now.
  • One KNL node is 1.48x speedup compared with a Raijin compute node.
  • Do not use hyperthread as it runs slower than without hyperthreading.

VASP

  • One KNL node is 3.79x speedup compared with one Raijin compute node.
  • VASP has hyper-scaling characteristic. Performance of 4 Raijin nodes (64 CPUs) with MPI is 5.75x speedup compared with 1 Raijin node. 4 Raijin nodes is 1.51x speed up compared with 1 KNL node which also has 64 CPUs, however, 4 Raijin nodes costs 4 times that of 1 KNL node (64 SUs per hour vs 16 SUs per hour).
  • The value of NCORE should also be tuned to get the best performance.
  • Do not use hyperthread as it runs slower than without hyperthreading.

Gromacs

  • One KNL node is 1.39x speedup compared with one Raijin compute node.
  • The tests were run without hypertheading.

Optimizing Code for KNL - Vectorization 

There are certain considerations to be taken into account before running legacy codes on KNL nodes. Primarily, the effective use of vector instructions is critical to achieving good performance on KNL cores. For guideline of how to get vectorization information and improve code vectorization, refer to How to Improve Code Vectorization.