Vectorization, also known as instruction level parallelism, is the process of converting a code from operating on a single value at a time to operating on a set of values (vector) at one time to speedup execution. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD). The processors on Gadi provide the following vectorization registers and instruction sets:
To get the most performance out of these processors, users should take advantage of these vector registers and instructions, and try to improve the usage of vectorization in their code. This document provides a guideline on how to get vectorization information and improve code vectorization.
Only loops meet the following conditions can be safely vectorized:
A loop that contains for example A[i] += A[i-1] (called read-after-write) cannot be vectorized as each iteration reads the data of previous iteration which has been changed in the previous iteration. When the code is compiled with '-qopt-report -qopt-report-phase=vec', the vectorization report contains the following:
LOOP BEGIN at example.c(13,4) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed FLOW dependence between a[i] (14:8) and a[i-1] (14:8) LOOP END
However, a loops with A[i] += A[i+1] (called write-after-read) can be safely vectorized as each iteration reads the data of next iteration (reading ahead) which has not been changed.
Loop count must be known at entry to the loop at runtime, though it does not need to be known at compile time, that is, it can be a variable, but must remain constant for the duration of the loop.
Loops with data-dependent exit cannot be vectorized.
Loops with iterations having different flow control cannot be vectorized, i.e., they must not branch.
Loops with function calls usually cannot be vectorized. Exceptions of this are intrinsic math functions and function that can be inlined.
If the data in the continuous iterations of a loop are not adjacent, such as loops with indirect addressing or non-unit stride, they must be loaded separately into memory using multiple instructions. Compilers rarely vectorize such loops unless the amount of the computation work is large enough compared to the overhead of non-contigous memory access.
There are various ways to get information regarding how a code is vectorized. The following information is for Intel compilers. For GCC compilers please refer to the corresponding man page or documentation.
Compiler option '-qopt-report=5' can be used to generate an optimization report, which contains vectorization information. To generate a report for vectorization only, use '-qopt-report -qopt-report-phase=vec'.
For example, for the following sample C code fragment in a file named example.c:
for (i=0; i<MAX; i++) c[i]=a[i]+b[i];
When the file is compiled with 'icc -qopt-report -qopt-report-phase=vec', a report file named example.optrpt will contain the following vectorization report:
LOOP BEGIN at example.c(10,4) remark #15300: LOOP WAS VECTORIZED LOOP END
When compiled with '-qopt-report=5', more detailed report will be generated in the example.optrpt report file:
Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par] LOOP BEGIN at example.c(10,4) remark #15388: vectorization support: reference c[i] has aligned access [ example.c(11,8) ] remark #15388: vectorization support: reference a[i] has aligned access [ example.c(11,13) ] remark #15388: vectorization support: reference b[i] has aligned access [ example.c(11,18) ] remark #15305: vectorization support: vector length 4 remark #15399: vectorization support: unroll factor set to 2 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 2 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 6 remark #15477: vector cost: 1.250 remark #15478: estimated potential speedup: 2.350 remark #15488: --- end vector cost summary --- remark #25015: Estimate of max trip count of loop=1 LOOP END
Intel Advisor has a feature called Vectorization Advisor to help analyse existing vectorization, detect 'hot' un-vectorized or under-vectorized loops, and provide advices to improve vectorization use.
For details of how to use Intel Advisor and its Vectorization Advisor, refer to Intel® Advisor Tutorial for Adding Efficient SIMD Parallelism to C++ Code Using the Vectorization Advisor for Linux.
Intel Advisor provides both command line and GUI tools, called advixe-cl and advixe-gui respectively. To use Intel Advisor to get vectorization information, users can compile and run their code in the following steps:
1) Compile and link Fortran/C/C++ program using corresponding Intel compiler with the vectorization option:
Example:
$ module load intel-compiler/<version> $ icc -g -qopenmp -O2 ./example.c –o ./example
Some Intel compiler options are listed below:
Compiler option | Function |
---|---|
-g | Build application with debug information to allow binary-to-source correlation in the reports |
-qopenmp | Enable generation of multi-threaded code if OpenMP directives/pragmas exist |
-O2 (or higher) | Request compiler optimization |
-vec | Enable vectorization if option O2 or higher is in effect (enabled by default) |
-simd | Enable SIMD directives/pragmas (enabled by default) |
For details of these options refer to man page or documentations of Intel compilers.
2) Submit a PBS job which executes the binary and runs the Intel Advisor command line tool advixe-cl to collect vectorisation information:
Example (PBS scripting part is omitted):
... $ module load intel-advisor/<version> $ advixe-cl --collect survey --project-dir ./advi ./example
For a 4-process MPI program, collect survey data into the shared ./advi project directory:
... $ module load intel-advisor/<version> $ mpirun -n 4 advixe-cl --collect survey --project-dir ./advi ./mpi_example
3) Once the job finishes, launch advixe-gui on a login node to visualize the data collected by advixe-cl:
Example:
$ module load intel-advisor/<version> $ advixe-gui &
Below is a screenshot of of the Survey Report:
If a loop cannot be vectorized with automatic vectorization, Intel Advisor will provide the reason and advices on how to fix the vectorization issues specific to your code, such as dependency analysis and memory access pattern analysis. Users should follow these advices, modify their source code and give compiler more hints to improve vectorization, by using compiler options or adding directives/pragmas to the source code (explicit vectorization).
The compiler option '-S' can be used to generate assembly code instead of binary. In the assembly code, SSE vector instructions generally operate on xmm registers, AVX and AVX2 on ymm registers, and AVX512 on zmm registers. AVX, AVX2, and AVX512 instructions are prefixed with 'v'.
Example of compiling a C program to generate assembly code instead of binary code:
$ icc -S -xHost -qopt-zmm-usage=high -o example.s example.c
(The '-qopt-zmm-usage=high' option above tells the compiler to generate zmm (AVX512) code without restrictions. Note that this is only for the explanation purpose of this documentation. Using this option may reduce the performance of your code.)
Example of generated vectorized assembly code:
vmovdqu32 %zmm1, (%rsi,%rdi,4) #11.8 vmovdqu32 %zmm0, 64(%rsi,%rdi,4) #11.8 vpaddd %zmm2, %zmm1, %zmm1 #11.15 vpaddd %zmm2, %zmm0, %zmm0 #11.15
Vectorization can be implemented either automatically by the compiler without any extra code, or explicitly by adding directives, pragmas, macros, etc. Automatic vectorization is enabled when a program is compiled using -O2 or higher options. Users can also specify '-xCORE-AVX512', '-xCOMMON-AVX512', or '-qopt-zmm-usage=high' flags, which tell the compiler to generate AVX512 vectorization instructions. Another option is to use the '-xHost' flag to tell the compiler to generate the highest level of vectorization supported on the processor. Refer to the man pages of the Intel compilers for more details about these options.
Intel compilers can also generate a single executable with multiple levels of vectorization with the '-ax' flag, which takes the same options as the -x flag (i.e., AVX, SSE2, ...). This flag will generate run-time checks to determine the level of vectorization support on the running processor and will then choose the optimal execution path for that processor.
Code does not always benefit from using the highest available vectorization instruction set such as AVX512, as the processors clock down when running AVX512 instructions and frequency changes cause overhead. For example, for the processors running on the Gadi login nodes, the base clock speed is 2.9GHZ when running scalar code but is only 1.9GHZ when running AVX512 instructions. Users should test the performance of their code on the targeted hardware of using specific instruction sets such as AVX512 or AVX2 to see if their code can benefit from using such instructions, or let the compiler make the decision instead of specifying a particular instruction set.
Intel compilers have a Guided Auto-Parallelization (GAP) option '-guide' (requires '-O2' or higher) to tell the compiler to generate advices on how to improve auto-vectorization, auto-parallelization, and data transformation, with '-guide-vec' providing guidance for auto-vectorization only. The advice may include suggestions for modifying source code, applying specific pragmas, or adding compiler options. For example, for the following code segment:
for (i=0;i<MAX;i++) { if(a[i]>0) b=a[i]; if(a[i]>1) a[i]+=b; }
When compiled with '-O2 -guide-vec -parallel' options, following GAP report will be printed out to stdout:
GAP REPORT LOG OPENED ON Wed Jul 14 15:23:14 2021 example.c(10): remark #30515: (VECT) Assign a value to the variable(s) "b" at the beginning of the body of the loop in line 10. This will allow the loop to be vectorized. [VERIFY] Make sure that, in the original program, the variable(s) "b" read in any iteration of the loop has been defined earlier in the same iteration. Number of advice-messages emitted for this compilation session: 1. END OF GAP REPORT LOG
Following the advice to assign a value to 'b' at the beginning of the loop, the loop will then be vectorized:
for (i=0;i<MAX;i++) { b=0; if(a[i]>0) b=a[i]; if(a[i]>1) a[i]+=b; }
Compiler automatic vectorization often fails to vectorize the code or cannot safely vectorize the code without additional information from the programmer. In such case, users can add directives, pragmas, etc in the code to help compilers generate more efficient vector code. Such directives and pragmas can be placed before a for loop or a function declaration.
Users can add compiler SIMD directives/pragmas to the source code to tell the compiler that dependency does not exist, so that the compiler can vectorize the loop when the user re-compiles the modified source code. Such SIMD directives/pragmas include:
vector: instruct the compiler to vectorize the loop according to the argument keywords. For the details of the clauses, please refer to this Intel documentation.
#pragma vector [clauses] for-loop
ivdep: instruct the compiler to ignore potential data dependencies.
#pragma ivdep for-loop
simd: enforce vectorization of a loop. For the details of the clauses, please refer to this Intel documentation.
#pragma simd [clauses] for-loop
Users can use OpenMP directives/pragmas for explicit vectorization.
omp simd: enforce vectorization of a loop. For the details of the clauses, please refer to this OpenMP documentation.
#pragma omp simd [clauses] for-loop
omp declare simd: instruct the compiler to vectorize a function. For the details of the clauses, please refer to this OpenMP documentation.
#pragma omp declare simd [clauses] function definition or declaration
omp ordered simd: instruct the compiler to execute the structured block inside a SIMD region using scalar instructions. For details please refer to this OpenMP documentation.
#pragma omp ordered simd structured-block
omp for simd: target same loop for threading and SIMD, with each thread executing SIMD instructions. For details please refer to this OpenMP documentation.
#pragma omp for simd [clauses] for-loop
omp declare simd processor(mic_avx512)*: instruct the compiler to use 512-bit registers inside SIMD-enabled functions.
#pragma omp declare simd processor(mic_avx512) function definition or declaration
* Only available for Intel compilers
Users can also use compiler options and macros for explicit vectorizaiton:
-D NOALIAS/-noalias: assert that there is no aliasing of memory references (array addresses or pointers) -D REDUCTION: apply an omp simd directive with a reduction clause -D NOFUNCCALL: remove the function and inline the loop -D ALIGNED/-align: assert that data is aligned on 16B boundary -fargument-noalias: function arguments cannot alias each other
Users can also declare and use SIMD enabled functions. In the example below, function foo is declared as an SIMD enabled function (vector function) and is vectorized, so is the for loop in which it is called.
__attribute__((vector)) float foo(float); void vfoo(float *restrict a, float *restrict b, int n){ int i; for (i=0; i<n; i++) { a[i] = foo(b[i]); } } float foo(float x) { ... }
Intel MKL, available by running 'module load intel-mkl/<version>' on Gadi, provides vectorized functions for users to call to take advantage of vectorization supported by the processors.
Intel IPP, available by running 'module load intel-ipp/<version>' on Gadi, also provides vectorized functions.