How to Improve Code Vectorization

On this page

What is Vectorization

Vectorization, also known as instruction level parallelism, is the process of converting a code from operating on a single value at a time to operating on a set of values (vector) at one time to speedup execution. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD). The processors on Gadi provide the following vectorization registers and instruction sets:

The Cascade Lake processors in the Gadi nodes in the 'normal', 'express', 'copyq', 'gpuvolta', 'hugemem', and 'megamem' queues provide 512-bit vector registers, and SSE (Streaming SIMD Extensions), SSE2, SSSE3, SSE4_1, SSE4_2, AVX (Advanced Vector Extensions), AVX2, AVX512, and AVX512 VNNI(Vector Neural Network Instructions, for deep-learning acceleration) instruction sets;
The Skylake processors in the Gadi nodes in the 'biodev' and 'normalsl' queues provide 512-bit vector registers, and SSE, SSE2, SSSE3, SSE4_1, SSE4_2, AVX, AVX2, AVX512 instruction sets;
The Broadwell processors in the Gadi nodes in the 'normalbw', 'expressbw', 'hugemembw', and 'megamembw' queues provide 256-bit vector registers, and SSE, SSE2, SSSE3, SSE4_1, SSE4_2, AVX, AVX2 instruction sets.

To get the most performance out of these processors, users should take advantage of these vector registers and instructions, and try to improve the usage of vectorization in their code. This document provides a guideline on how to get vectorization information and improve code vectorization.

Vectorizable Loops

Only loops meet the following conditions can be safely vectorized:

No Data Dependency

A loop that contains for example A[i] += A[i-1] (called read-after-write) cannot be vectorized as each iteration reads the data of previous iteration which has been changed in the previous iteration. When the code is compiled with '-qopt-report -qopt-report-phase=vec', the vectorization report contains the following:

LOOP BEGIN at example.c(13,4)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed FLOW dependence between a[i] (14:8) and a[i-1] (14:8)
LOOP END

However, a loops with A[i] += A[i+1] (called write-after-read) can be safely vectorized as each iteration reads the data of next iteration (reading ahead) which has not been changed.

Countable

Loop count must be known at entry to the loop at runtime, though it does not need to be known at compile time, that is, it can be a variable, but must remain constant for the duration of the loop.

Single Entry and Single Exit

Loops with data-dependent exit cannot be vectorized.

Straight-line Code

Loops with iterations having different flow control cannot be vectorized, i.e., they must not branch.

No Function Calls

Loops with function calls usually cannot be vectorized. Exceptions of this are intrinsic math functions and function that can be inlined.

Contiguous Memory Access

If the data in the continuous iterations of a loop are not adjacent, such as loops with indirect addressing or non-unit stride, they must be loaded separately into memory using multiple instructions. Compilers rarely vectorize such loops unless the amount of the computation work is large enough compared to the overhead of non-contigous memory access.

Get Vectorization Information

There are various ways to get information regarding how a code is vectorized. The following information is for Intel compilers. For GCC compilers please refer to the corresponding man page or documentation.

Compiler options '-qopt-report=5' and '-qopt-report -qopt-report-phase=vec'

Compiler option '-qopt-report=5' can be used to generate an optimization report, which contains vectorization information. To generate a report for vectorization only, use '-qopt-report -qopt-report-phase=vec'.

For example, for the following sample C code fragment in a file named example.c:

for (i=0; i<MAX; i++)
    c[i]=a[i]+b[i];

When the file is compiled with 'icc -qopt-report -qopt-report-phase=vec', a report file named example.optrpt will contain the following vectorization report:

LOOP BEGIN at example.c(10,4)
   remark #15300: LOOP WAS VECTORIZED
LOOP END

When compiled with '-qopt-report=5', more detailed report will be generated in the example.optrpt report file:

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

LOOP BEGIN at example.c(10,4)
   remark #15388: vectorization support: reference c[i] has aligned access   [ example.c(11,8) ]
   remark #15388: vectorization support: reference a[i] has aligned access   [ example.c(11,13) ]
   remark #15388: vectorization support: reference b[i] has aligned access   [ example.c(11,18) ]
   remark #15305: vectorization support: vector length 4
   remark #15399: vectorization support: unroll factor set to 2
   remark #15300: LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 2
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 6
   remark #15477: vector cost: 1.250
   remark #15478: estimated potential speedup: 2.350
   remark #15488: --- end vector cost summary ---
   remark #25015: Estimate of max trip count of loop=1
LOOP END

Vectorization Advisor of Intel Advisor

Intel Advisor has a feature called Vectorization Advisor to help analyse existing vectorization, detect 'hot' un-vectorized or under-vectorized loops, and provide advices to improve vectorization use.

For details of how to use Intel Advisor and its Vectorization Advisor, refer to Intel® Advisor Tutorial for Adding Efficient SIMD Parallelism to C++ Code Using the Vectorization Advisor for Linux.

Intel Advisor provides both command line and GUI tools, called advixe-cl and advixe-gui respectively. To use Intel Advisor to get vectorization information, users can compile and run their code in the following steps:

1) Compile and link Fortran/C/C++ program using corresponding Intel compiler with the vectorization option:

Example:

$ module load intel-compiler/<version>
$ icc -g -qopenmp -O2 ./example.c –o ./example

Some Intel compiler options are listed below:

Compiler option	Function
-g	Build application with debug information to allow binary-to-source correlation in the reports
-qopenmp	Enable generation of multi-threaded code if OpenMP directives/pragmas exist
-O2 (or higher)	Request compiler optimization
-vec	Enable vectorization if option O2 or higher is in effect (enabled by default)
-simd	Enable SIMD directives/pragmas (enabled by default)

For details of these options refer to man page or documentations of Intel compilers.

2) Submit a PBS job which executes the binary and runs the Intel Advisor command line tool advixe-cl to collect vectorisation information:

Example (PBS scripting part is omitted):

...
$ module load intel-advisor/<version>
$ advixe-cl --collect survey --project-dir ./advi ./example

For a 4-process MPI program, collect survey data into the shared ./advi project directory:

...
$ module load intel-advisor/<version>
$ mpirun -n 4 advixe-cl --collect survey --project-dir ./advi ./mpi_example

3) Once the job finishes, launch advixe-gui on a login node to visualize the data collected by advixe-cl:

Example:

$ module load intel-advisor/<version>
$ advixe-gui &

Click on "Open Project/Result" tab and then select the .advixeexp file generated by advixe-cli in previous step.
In the "Summary" part the summary of the report generated by Vectorization Advisor is shown. Vector instruction sets used, vectorization gain/efficiency are shown. Below is a screenshot of the summary:

The Survey Report provides detailed compiler report data and performance data regarding vectorization such as:
- Which loops are vectorized, the location in the source code
- Vectorization issues
- The reason why a loop is not vectorized
- Vector ISA used
- Vectorization efficiency, speedup
- Vector length (# of elements processed in the SIMD instruction)
- Vectorization instructions used

Below is a screenshot of of the Survey Report:

If a loop cannot be vectorized with automatic vectorization, Intel Advisor will provide the reason and advices on how to fix the vectorization issues specific to your code, such as dependency analysis and memory access pattern analysis. Users should follow these advices, modify their source code and give compiler more hints to improve vectorization, by using compiler options or adding directives/pragmas to the source code (explicit vectorization).

Compiler option '-S'

The compiler option '-S' can be used to generate assembly code instead of binary. In the assembly code, SSE vector instructions generally operate on xmm registers, AVX and AVX2 on ymm registers, and AVX512 on zmm registers. AVX, AVX2, and AVX512 instructions are prefixed with 'v'.

Example of compiling a C program to generate assembly code instead of binary code:

$ icc -S -xHost -qopt-zmm-usage=high -o example.s example.c

(The '-qopt-zmm-usage=high' option above tells the compiler to generate zmm (AVX512) code without restrictions. Note that this is only for the explanation purpose of this documentation. Using this option may reduce the performance of your code.)

Example of generated vectorized assembly code:

vmovdqu32 %zmm1, (%rsi,%rdi,4)                          #11.8
vmovdqu32 %zmm0, 64(%rsi,%rdi,4)                        #11.8
vpaddd    %zmm2, %zmm1, %zmm1                           #11.15
vpaddd    %zmm2, %zmm0, %zmm0                           #11.15

Automatic Vectorization

Vectorization can be implemented either automatically by the compiler without any extra code, or explicitly by adding directives, pragmas, macros, etc. Automatic vectorization is enabled when a program is compiled using -O2 or higher options. Users can also specify '-xCORE-AVX512', '-xCOMMON-AVX512', or '-qopt-zmm-usage=high' flags, which tell the compiler to generate AVX512 vectorization instructions. Another option is to use the '-xHost' flag to tell the compiler to generate the highest level of vectorization supported on the processor. Refer to the man pages of the Intel compilers for more details about these options.

Intel compilers can also generate a single executable with multiple levels of vectorization with the '-ax' flag, which takes the same options as the -x flag (i.e., AVX, SSE2, ...). This flag will generate run-time checks to determine the level of vectorization support on the running processor and will then choose the optimal execution path for that processor.

Code does not always benefit from using the highest available vectorization instruction set such as AVX512, as the processors clock down when running AVX512 instructions and frequency changes cause overhead. For example, for the processors running on the Gadi login nodes, the base clock speed is 2.9GHZ when running scalar code but is only 1.9GHZ when running AVX512 instructions. Users should test the performance of their code on the targeted hardware of using specific instruction sets such as AVX512 or AVX2 to see if their code can benefit from using such instructions, or let the compiler make the decision instead of specifying a particular instruction set.

Guided Auto-Parallelization

Intel compilers have a Guided Auto-Parallelization (GAP) option '-guide' (requires '-O2' or higher) to tell the compiler to generate advices on how to improve auto-vectorization, auto-parallelization, and data transformation, with '-guide-vec' providing guidance for auto-vectorization only. The advice may include suggestions for modifying source code, applying specific pragmas, or adding compiler options. For example, for the following code segment:

   for (i=0;i<MAX;i++) {
       if(a[i]>0) b=a[i];
       if(a[i]>1) a[i]+=b;
   }

When compiled with '-O2 -guide-vec -parallel' options, following GAP report will be printed out to stdout:

GAP REPORT LOG OPENED ON Wed Jul 14 15:23:14 2021

example.c(10): remark #30515: (VECT) Assign a value to the variable(s) "b" at the beginning of the body of the loop in line 10. This will allow the loop to be vectorized. [VERIFY] Make sure that, in the original program, the variable(s) "b" read in any iteration of the loop has been defined earlier in the same iteration.

Number of advice-messages emitted for this compilation session: 1.
END OF GAP REPORT LOG

Following the advice to assign a value to 'b' at the beginning of the loop, the loop will then be vectorized:

   for (i=0;i<MAX;i++) {
       b=0;
       if(a[i]>0) b=a[i];
       if(a[i]>1) a[i]+=b;
   }

Explicit Vectorization

Compiler automatic vectorization often fails to vectorize the code or cannot safely vectorize the code without additional information from the programmer. In such case, users can add directives, pragmas, etc in the code to help compilers generate more efficient vector code. Such directives and pragmas can be placed before a for loop or a function declaration.

Compiler SIMD directives/pragmas

Users can add compiler SIMD directives/pragmas to the source code to tell the compiler that dependency does not exist, so that the compiler can vectorize the loop when the user re-compiles the modified source code. Such SIMD directives/pragmas include:

vector: instruct the compiler to vectorize the loop according to the argument keywords. For the details of the clauses, please refer to this Intel documentation.
```
#pragma vector [clauses]
     for-loop
```
ivdep: instruct the compiler to ignore potential data dependencies.
```
#pragma ivdep
     for-loop
```
simd: enforce vectorization of a loop. For the details of the clauses, please refer to this Intel documentation.
```
#pragma simd [clauses]
     for-loop
```

OpenMP directives/pragmas

Users can use OpenMP directives/pragmas for explicit vectorization.

omp simd: enforce vectorization of a loop. For the details of the clauses, please refer to this OpenMP documentation.
```
#pragma omp simd [clauses]
     for-loop
```
omp declare simd: instruct the compiler to vectorize a function. For the details of the clauses, please refer to this OpenMP documentation.
```
#pragma omp declare simd [clauses]
     function definition or declaration
```
omp ordered simd: instruct the compiler to execute the structured block inside a SIMD region using scalar instructions. For details please refer to this OpenMP documentation.
```
#pragma omp ordered simd 
     structured-block
```
omp for simd: target same loop for threading and SIMD, with each thread executing SIMD instructions. For details please refer to this OpenMP documentation.
```
#pragma omp for simd [clauses]
    for-loop
```
omp declare simd processor(mic_avx512)*: instruct the compiler to use 512-bit registers inside SIMD-enabled functions.
```
#pragma omp declare simd processor(mic_avx512)
      function definition or declaration
```

* Only available for Intel compilers

Compiler options and macros

Users can also use compiler options and macros for explicit vectorizaiton:

-D NOALIAS/-noalias: assert that there is no aliasing of memory references (array addresses or pointers)
-D REDUCTION: apply an omp simd directive with a reduction clause
-D NOFUNCCALL: remove the function and inline the loop
-D ALIGNED/-align: assert that data is aligned on 16B boundary
-fargument-noalias: function arguments cannot alias each other

SIMD enabled functions

Users can also declare and use SIMD enabled functions. In the example below, function foo is declared as an SIMD enabled function (vector function) and is vectorized, so is the for loop in which it is called.

__attribute__((vector))  
float foo(float); 
void vfoo(float *restrict a, float *restrict b, int n){ 
    int i; 
    for (i=0; i<n; i++) { a[i] = foo(b[i]); } 
} 
float foo(float x) { ... }

Intel Math Kernel Library (MKL) vectorized functions

Intel MKL, available by running 'module load intel-mkl/<version>' on Gadi, provides vectorized functions for users to call to take advantage of vectorization supported by the processors.

Intel Integrated Performance Primitives (IPP) vectorized functions

Intel IPP, available by running 'module load intel-ipp/<version>' on Gadi, also provides vectorized functions.

Programming Guidelines for Writing Vectorizable Code

Use simple loops, avoid variant upper iteration limit and data-dependent loop exit conditions;
Write straight-line code, avoid branches, most function calls or if constructs;
Use array notations instead of pointers;
Use unit stride (increment 1 for each iteration) in inner loops;
Use aligned data layout (memory addresses). For using SSE instructions, align the data to 16-byte boundaries; for using AVX and AVX2 instructions, align the data to 32-byte boundaries; and for using AVX512 instructions, align the data to 64-byte boundaries;
Use structure of arrays instead of arrays of structures;
Use only assignment statements in the innermost loops;
Avoid data dependencies between loop iterations, such as read-after-write, write-after-read, write-after-write;
Avoid indirect addressing;
Avoid mixing vectorizable data types in the same loop;
Avoid functions calls in innermost loop, except math library calls.

Page tree