Sum and dot
All the computations that we have discussed so far are, in the jargon of SIMD programming, vertical operations. For each element of the input arrays, they produce zero or one matching output array element. From the perspective of software performance, these vertical operations are the best-case scenario, and compilers know how to produce very efficient code out of them without much assistance from the programmer. Which is why we have not discussed performance much so far.
But in this chapter, we will now switch our focus to horizontal operations, also known as reductions. We will see why these operations are much more challenging to compiler optimizers, and then the next few chapters will cover what programmers can do to make them more efficient.
Summing numbers
One of the simplest reduction operations that one can do with an array of floating-point numbers is to compute the sum of the numbers. Because this is a common operation, Rust iterators make it very easy by providing a dedicated method for it:
#![allow(unused)] fn main() { let sum = [1.2, 3.4, 5.6, 7.8].into_iter().sum::<f32>(); }
The only surprising thing here might be the need to spell out the type of the sum. This need does not come up because the compiler does not know about the type of the array elements that we are summing. That could be handled by the default “unknown floats are f64” fallback.
The problem is instead that the sum()
method is generic in order to be able to
work with both numbers and reference to numbers (which we have not covered yet,
for now you can think of them as being like C pointers). And wherever there is
genericity in Rust, there is loss of type inference.
Dot product
Once we have zip()
, map()
and sum()
, it takes only very little work to
combine them in order to implement a simple Euclidean dot product:
#![allow(unused)] fn main() { let x = [1.2f32, 3.4, 5.6, 7.8]; let y = [9.8, 7.6, 5.4, 3.2]; let dot = x.into_iter().zip(y) .map(|(x, y)| x * y) .sum::<f32>(); }
Hardware-minded people, however, may know that we are leaving some performance and floating-point precision on the table by doing it like this. Modern CPUs come with fused multiply-add operations that are as costly as an addition or multiplication, and do not round between the multiplication and addition which results in better output precision.
Rust exposes these hardware operations1, and we can use them by switching to
a more general cousin of sum()
called fold()
:
#![allow(unused)] fn main() { let x = [1.2f32, 3.4, 5.6, 7.8]; let y = [9.8, 7.6, 5.4, 3.2]; let dot = x.into_iter().zip(y) .fold(0.0, |acc, (x, y)| x.mul_add(y, acc)); }
Fold works by initializing an accumulator variable with a user-provided value, and then going through each element of the input iterator and integrating it into the accumulator, each time producing an updated accumulator. In other words, the above code is equivalent to this:
#![allow(unused)] fn main() { let x = [1.2f32, 3.4, 5.6, 7.8]; let y = [9.8, 7.6, 5.4, 3.2]; let mut acc = 0.0; for (x, y) in x.into_iter().zip(y) { acc = x.mul_add(y, acc); } let dot = acc; }
And indeed, the iterator fold method optimizes just as well as the above imperative code, resulting in identical generate machine code. The problem is that unfortunately, that imperative code itself is not ideal and will result in very poor computational performance. We will now show how to quantify this problem, and later explain why it happens and what you can do about it.
Setting up criterion
The need for criterion
With modern hardware, compiler and operating systems, measuring the performance of short-running code snippets has become a fine art that requires a fair amount of care.
Simply surrounding code with OS timer calls and subtracting the readings may have worked well many decades ago. But nowadays it is often necessary to use specialized tooling that leverages repeated measurements and statistical analysis, in order to get stable performance numbers that truly reflect the code’s performance and are reproducible from one execution to another.
The Rust compiler has built-in tools for this, but unfortunately they are not
fully exposed by stable versions at the time of writing, as there is a
longstanding desire to clean up and rework some of the associated APIs before
exposing them for broader use. As a result, third-party libraries should be used
for now. In this course, we will be mainly using
criterion
as it is
by far the most mature and popular option available to Rust developers
today.2
Adding criterion
to a project
Because criterion
is based on the Rust compiler’s benchmarking infrastructure,
but cannot fully use it as it is not completely stabilized yet, it does
unfortunately require a fair bit of unclean setup. First you must add a
dependency on criterion
in your Rust project.
Because this course has been designed to work on HPC centers without Internet access on worker nodes, we have already done this for you in the example source code. But for your information, it is done using the following command:
cargo add --dev criterion
cargo add
is an easy-to-use tool for editing your Rust project’s Cargo.toml
configuration file in order to register a dependency on (by default) the latest
version of some library. With the --dev
option, we specify that this
dependency will only be used during development, and should not be included in
production builds, which is the right thing to do for a benchmarking harness.
After this, every time we add a benchmark to the application, we will need to
manually edit the Cargo.toml
configuration file in order to add an entry that
disables the Rust compiler’s built-in benchmark harness. This is done so that it
does not interfere with criterion
’s work by erroring out on criterion-specific
CLI benchmark options that it does not expect. The associated Cargo.toml
configuration file entry looks like this:
[[bench]]
name = "my_benchmark"
harness = false
Unfortunately, this is not yet enough, because benchmarks can be declared pretty
much anywhere in Rust. So we must additionally disable the compiler’s built-in
benchmark harness on every other binary defined by the project. For a simple
library project that defines no extra binaries, this extra Cargo.toml
configuration entry should do it:
[lib]
bench = false
It is only after we have done all of this setup, that we can get criterion
benchmarks that will reliably accept our CLI arguments, no matter how they were
started.
A benchmark skeleton
Now that our Rust project is set up properly for benchmarking, we can start
writing a benchmark. First you need to create a benches
directory in
your project if it does not already exists, and create a source file there,
named however you like with a .rs
extension.
Then you must add the criterion
boilerplate to this source file, which is
partially automated using macros, in order to get a runnable benchmark that
integrates with the standard cargo bench
tool…
use criterion::{black_box, criterion_group, criterion_main, Criterion};
pub fn criterion_benchmark(c: &mut Criterion) {
// This is just an example benchmark that you can freely delete
c.bench_function("sqrt 4.2", |b| b.iter(|| black_box(4.2).sqrt()));
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
…and finally you must add the aforementioned Cargo.toml
boilerplate so that
criterion CLI arguments keep working as expected. Assuming you unimaginatively
named your benchmark source file “benchmark.rs”, this would be…
[[bench]]
name = "benchmark"
harness = false
Writing a good microbenchmark
There are a few basic rules that you should always follow whenever you are writing a microbenchmark that measures the performance of a small function in your code, if you do not want the compiler’s optimizer to transform your benchmark in unrealistic ways:
- Any input value that is known to the compiler’s optimizer can be used to tune
the code specifically for this input value, and sometimes even to reduce the
benchmarking loop to a single iteration by leveraging the fact that the
computation always operate on the same input. Therefore, you must always hide
inputs from the compiler’s optimizer using an optimization barrier such as
criterion
’sblack_box()
. - Any output value that is not used in any way is a useless computation in the
eyes of the compiler’s optimizer. Therefore, the compiler’s optimizer will
attempt to delete any code that is involved in the computation of such values.
To avoid this, you will want again to feed results into an optimization
barrier like
criterion
’sblack_box()
.criterion
implicitly does this for any output value that you return from theiter()
API’s callback.
It’s not just about the compiler’s optimizer though. Hardware and operating systems can also leverage the regular nature of microbenchmarks to optimize performance in unrealistic ways, for example CPUs will exhibit an unusually good cache hit rate when running benchmarks that always operate on the same input values. This is not something that you can guard against, just a pitfall that you need to keep in mind when interpreting benchmark results: absolute timings are usually an overly optimistic estimate of your application’s performance, and therefore the most interesting output of microbenchmarks is actually not the raw result but the relative variations of this result when you change the code that is being benchmarked.
Finally, on a more fundamental level, you must understand that on modern hardware, performance usually depends on problem size in a highly nonlinear and non-obvious manner. It is therefore a good idea to test the performance of your functions over a wide range of problem sizes. Geometric sequences of problem sizes like 1, 2, 4, 8, 16, 32, … are often a good default choice.
Exercise
Due to Rust benchmark harness design choices, the exercise for this chapter
will, for once, not take place in the examples
subdirectory of the exercises’
source tree.
Instead, you will mainly work on the benches/06-summit.rs
source file, which
is a Criterion benchmark that was created using the procedure described above.
Implement the sum()
function within this benchmark to make it sum the elements
of its input Vec
, then run the benchmark with cargo bench --bench 06-summit
.
If you are using the Devana cluster, you will want to run the benchmarks on a
worker node by prefixing this command with srun
.
To correctly interpret the results, you should know that a single core of a modern x86 CPU, with a 2 GHz clock rate and AVX SIMD instructions, can perform 32 billion f32 sums per second.3
Technically, at this stage of our performance optimization journey, we can
only use these operations via a costly libm function call. This happens
because not all x86_64 CPUs support the fma
instruction family, and the
compiler has to be conservative in order to produce code that runs
everywhere. Later, we will see how to leverage modern hardware better
using the multiversion
crate.
Although criterion would be my recommendation today due to its
feature-completeness and popularity, divan
is
quickly shaping up into an interesting alternative that might make it the
recommendation in a future edition of this course. Its benefits over
criterion
include significantly improved API ergonomics, faster
measurements, and better support for code that is generic over
compile-time quantities aka “const generics”.
Recent Intel CPUs (as of 2024) introduced the ability to perform a third SIMD sum per clock cycle, which bumps the theoretical limit to 48 billion f32 sums per second per 2 GHz CPU core, or even double that on those few CPUs that support AVX-512 with native-width ALUs.