Showing vectorization for different compilers

Note that this article contains results of the Fujitsu compiler, which is not available anymore on Ookami

Vectorization for different compilers

Compilers can do vectorization when setting the right flags. Here we are showing two examples of code compiled with the Cray, Arm and gnu compiler.

Simple math functions

The investigated functions are: Simple (Y = 2 X + 3 X²), Reciprocal, Square root, Exponential, Sin, Power function. Those are compiled using three different compilers, Cray, Arm and GNU (source code here). The compiler specific vectorization flags are turned on.

void Xsimple(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = 2.0*x[i] + 3.0*x[i]*x[i];
}

void Xrecip(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = 1.0/x[i];
}

void Xsqrt(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = std::sqrt(x[i]);
}

void Xexp(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = std::exp(x[i]);
}

void Xsin(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = std::sin(x[i]);
}

void Xpow(size_t n, const double* __restrict__ x, double* __restrict__ y) {
  for (size_t i=0; i<n; i++) y[i] = std::pow(x[i],0.55);
}

Below you can find the compiler versions and flags (for vectorization and vectorization reports) used for this example

Cray

module load CPE/22.03
Cray C++ : Version 10.0.3

with the compiler flags

-O3 -h aggress,flex_mp=tolerant,msgs,negmsgs,vector3,omp

Flag description:

O3
Optimization level 3

aggress
Provides greater opportunity to optimize loops that would otherwise by inhibited from optimization due to an internal compiler size limitation.

flex_mp=tolerant
Controls the aggressiveness of optimizations which may affect floating point and complex repeatability when application requirements require identical results whenvarying the number of ranks or threads. Tolerant uses most aggressive optimization and yields highest performance, but results may not be sufficiently repeatable for some applications

msgs
Causes the compiler to write optimization messages to stderr.

negmsgs
Causes the compiler to generate messages to stderr that indicate why optimizations such as vectorization or inlining did not occur in a given instance.

vector3
Specifies the level of automatic vectorizing to be performed. Vectorization results in dramatic performance improvements with a small increase in object code size. Vectorization directives are unaffected by this option. 3 specifies aggressive vectorization.

omp
OMP supportArm

module load arm-modules/22.0
Arm C/C++/Fortran Compiler version 22.0.1

with the compiler flags

-Ofast -ffp-contract=fast -Wall -Rpass=loop-vectorize -march=armv8.2-a+sve -mcpu=a64fx -armpl -fopenmp

Flag description:

Ofast
Enables all the optimizations from level 3 including those performed with the -ffp-mode=fast armclang option. This level also performs other aggressive optimizations that might violate strict compliance with language standards. -Ofast implies -ffast-math.

ffp-contract=fast
If you set -ffp-contract=fast fused floating-point contractions are always used and the compiler ignores the 'STDC FP_CONTRACT' pragma setting.

Wall
Enable all warnings.

Rpass=loop-vectorize
Enable vectorization report

march=armv8.2-a+sve
Specifies architecture and extensions.

mcpu=a64fx
Select CPU architecture.

armpl
Use the 'Generic' SVE library from Arm Performance Libraries.

fopenmp
Enable OpenMPGNU

module load gcc/12.1.0
gcc (GCC) 12.1.0

with the compiler flags

-Ofast -Wall -mtune=a64fx -mcpu=a64fx -march=armv8.2-a+sve -fopt-info-vec -fopenmp

Flag description:

Ofast, Wall, mcpu=a64fx, march=armv8.2-a+sve
see descriptions above

mtune=a64fx
Tune to cpu-type

fopt-info-vec
Output vectorization report.

Fujitsu

module load fujitsu/compiler/4.7
FCC (FCC) 4.7.0 20211110

with the compiler flags

 -Kfast -KSVE -Koptmsg=2

Flag description:

Kfast
Optimization

KSVE
Vectorization

Koptmsg=2
Output vectorization report.

When compiling the compiler output suggests that Fujitsu, Cray and Arm vectorize all functions, whereas GNU can't vectorize exp, sin and pow.

	Fujitsu	Cray	Arm	GNU
Simple (Y = 2 X + 3 X²)
Reciprocal
Square root
Exponential
Sin
Power

However, looking at the runtimes of the functions gives a more complex picture (see Figure 1 & 2). The Fujitsu and cray compilers vectorizes everything as claimed. The arm compiler claims to vectorize all functions. It is doing this but some functions are not vectorized in the most efficient way. The code assembly shows that it uses the DIV and SQRT functions rather than the more efficient Netwon algorithm. And the gnu compiler just vectorizes the simple function. The recip and sqrt are not vectorized as expected from the compiler output.

Figure 1 & 2: Runtimes of the simple math functions for different compilers.

The Fujitsu compiler gives the best results.

Frequently Asked Questions