Environment setup
To do the practical work from this course, you will need access to a suitably configured Rust development environment. We support three different ways to get such an environment, but depending on your particular situation, they may not all be available/suitable for you:
- If you are using a Linux desktop environment1 and a CPU that is based on the common x86_64 CPU architecture (any modern CPU from Intel or AMD), then another way to get a pre-built development environment is to use Linux containers.
- If none of the above applies (e.g. you use a modern Mac with an Arm CPU), or if you fancy it for any another reason, then you can also manually install the required packages on your computer.
There are non-native ways to run Linux containers from a Windows or macOS
desktop: Docker Desktop, WSL2, etc. They all work by running the container
inside of a hidden x86 Linux VM. We don’t recommend them because in our
experience you may experience issues connecting to the code editor of the
rust_code_server
image, and surprising runtime performance
characteristics on Arm-based Macs due to Rosetta emulation.
Linux containers
We provide two different containerized Rust development environment options, geared towards two different kinds of container users.
- For maximal ease of use, the
rust_code_server
environment provides a pre-configured installation of the popular VS Code editor that you can use from your web browser. This is the recommended choice if you like this code editor, are not very familiar with containers or just want to get something working with minimal effort. - If you are an experienced container user and want more control, the
rust_light
environment only provides the minimum CLI tooling and libraries needed to build and run the course’s code. You can then set up any code editing environment you like on top of it.
Both environments have been tested with the Docker and Podman container runtimes. They are not compatible with Apptainer/Singularity because the way this container runtime manages user accounts and home directories is fundamentally at odds with our needs.
Using rust_code_server
The rust_code_server
environment provides a pre-configured development
environment based on the VS Code editor, accessible from your web browser. It is
the recommended route if you want to get a working environment with minimal
effort on a native Linux system.
Setting up Docker or Podman
You should first install one of Docker Engine or Podman, if you have not done so already. The name of each container runtime in the previous sentence links to the recommended setup instructions.
If you are unfamiliar with containers and just want the easiest default choice,
Docker is easiest to install and would therefore be my recommendation for
beginners. Just make sure that you do go for a native installation of Docker
Engine, and not a Docker Desktop installation. The latter is not suitable for
using rust_code_server
because it uses a hidden virtual machine under the
hood, which will prevent you from using the integrated code editor later in this
tutorial.1
After installation, you may need to do a little bit of extra system configuration to make the container runtime usable by non-root users (as explained in the installation instructions linked above). Then you should be able to run a test container that prints a hello world message.
Please pick your container runtime below to see the appropriate command for this test:
docker run hello-world
Without getting into details, we had to deal with a security/usability tradeoff here. Either we allowed everyone on your local network to access your code editor, or we broke Docker Desktop’s hidden VM. We chose the secure option that breaks Docker Desktop but keeps your code editor private.
Downloading the exercises source code
You may have been directed to this documentation some time ahead of the course, at a point where the material may not be fully finalized yet. To handle material evolution, and to allow you to save your work, we distribute the exercises’ source code via a separate archive that you should unpack somewhere on your machine. The resulting directory will then be mounted inside of the container.
Provided that the wget
and unzip
utilities are installed, you can download
and unpack the source code in the current directory using the following sequence
of commands:
if [ -e exercises ]; then
echo "ERROR: Please move or delete the existing 'exercises' subdirectory"
else
wget https://numerical-rust-cpu-d1379d.pages.math.cnrs.fr/setup/exercises.zip \
&& unzip exercises.zip \
&& rm exercises.zip
fi
The following steps will assume that you are in the directory where the archive
has been extracted, as if you just ran the sequence of commands above (an
exercises/
subdirectory should be present in the output of ls
).
Starting the container
Now that you have a working container runtime and the exercises’ source code,
you should be able to run a container based on the development environment. We
will need to change a few things with respect to the default configuration of
docker run
and podman run
:
- For interactive use, we will want to follow the container’s standard output
and be able to interact with it using our terminal. This can be done using
the
-it
pair of options. - There is no point in keeping around interactive containers after we have
stopped them, so it is a good idea to automatically delete the container after
use via the
--rm
option. - Our code editing environement uses an HTTP server on port 8080, which we must
expose to be able to connect to it. The easiest way to do this is to use the
--net=host
option. - And finally, we need to mount the exercises material into the container so
that it can be used inside. This can be done using the
-v "$(pwd)/exercises":/home/jovyan/exercises
option.- If you use podman, then you should add an
:U
suffix at the end of this option so that non-privileged users inside of the container get write access to the code.
- If you use podman, then you should add an
Putting this all together, we get the following Docker and Podman command lines:
docker run --net=host --rm -it \
-v "$(pwd)/exercises":/home/jovyan/exercises \
registry.plmlab.math.cnrs.fr/grasland/numerical-rust-cpu/rust_code_server:latest
If you get an error message about port 8080 already being in use, it likely
means that another software on your machine (e.g. Jupyter) is already listening
for network connections on port 8080. In that case, you must hunt and close the
offending software using tools like ss -tnlp
.
Once you solve these TCP port allocation issues, you will get a few lines of output notifying you that code-server is ready to serve HTTP requests. Before that, there will be a message along these lines…
### Use the following password: xxxxxxxxxxxxxxxxxxxxxxxx
…where xxxxxxxxxxxxxxxxxxxxxxxx is a string of hexadecimal digits. Copy this string into your clipboard, then point your web browser to the http://127.0.0.1:8080 URL, and paste the password when asked to. Once that is done, you should land into a browser-based version of VSCode.
Sadly, as you will quickly figure out, this code editor is not fully functional
yet because it cannot write into the ~/exercises
directory that we have
mounted inside of the container.
To fix this, you will need to check back in the terminal where you ran the
docker run
command, and look for the container’s output right after the “Use
the following password” line:
### If using Docker, mounted files must be chown'd to uuuu:gggg
…where uuuu
and gggg
are the UID and the GID of the user running the code
editor inside of the container. This is the information that we are going to
need next.
Fixing up file permissions
One drawback of using Linux containers is that the users inside of the containers do not match those of your local system, which causes all sorts of access permission problems when sharing files between the container and the host. Docker and Podman handle this issue a little differently.
With Docker, you are responsible for changing the permissions of the directories that you mount inside of the container, so that the user inside of the container can access it.
This can be done by running the command sudo chown -R uuuu:gggg exercises
,
where uuuu
and gggg
are the user and group ID that the container prints on
startup.
You must do this in order to let the provided code editor edit files in the
exercises/
directory.
At the end of the course, with both Docker and Podman, you will need to
restore the original file permissions in order to be able to manipulate the
exercises/
directory normally again. This can be done with the following
command:
sudo chown -R $(id -u):$(id -g) exercises/
Suggested workflow
The suggested workflow during the course is for you to have this course’s handouts opened in one browser tab, and the code editor opened in another tab. If your screen is large enough, two browser windows side by side will provide extra comfort.
After performing basic Visual Studio Code configuration, I advise opening a
terminal, which you can do by showing the bottom pane of the code editor using
the Ctrl+J
keyboard shortcut.
You would then direct this terminal to the exercises directory if you’re not already there…
cd ~/exercises
And, when you’re ready to do the exercises, start running cargo run
with the
options specified by the corresponding course material.
Using rust_light
The rust_light
image
provides the minimal amount of CLI tooling and libraries needed to do the
practicals. It is based on Ubuntu 22.04, with libraries installed system-wide
and the Rust toolchain and associated software installed locally to the root
user’s home directory (/root
).
The intent is for you to layer whatever code editing environment you prefer1
on top of this using your favorite container image building mechanisms:
Dockerfiles, docker commit
, buildah
…
Alternatively, you can just directly bind-mount the source code as done with
the rust_code_server
image (minus the
:U
bind mount flag for Podman, see bullets below) and start editing it with
your local code editor on the host system, only using the container for builds
and executions. Compared to the suggested rust_code_server
workflow…
- You cannot get early feedback from
rust-analyzer
while editing code, because it must be running inside of the container to have access to all dependencies. In general, if your code editor’s has check/build/run shortcuts or other IDE functionality, it won’t work as expected. - You will not experience any file permission issues inside of the container
and can/should drop the
:U
bind mount flag when using Podman, because everything in therust_light
container runs as the root user, and root can write to any file no matter which UID/GID is the owner. - If you are using Docker, you will still get file permission issues outside of the container because on Linux, any file created by the root user inside of a Docker container is owned by the host root account. Podman does not suffer from this issue.2
Given that this environment is open-ended and geared towards expert container users who want to go in exotic directions, a detailed tutorial cannot be easily written, so I invite you to just e-mail the trainer if you run into any trouble while going this route, and we’ll try to figure it out together.
Be warned that getting X11/Wayland software to work inside of Linux containers involves a fair amount of suffering. You will make your life easier by favoring editors that run directly in the terminal (like vi and nano) or expose a web-based gui via an HTTP server (like jupyterlab and code-server).
Speaking from the perspective of using Docker and Podman in the conventional way, which for Docker involves a daemon with root privileges and for Podman involves a rootless setup with sub-UIDs/GIDs. If you are using more exotic setups, like rootless Docker or sudo’d Podman commands, your experience may differ.
Local installation
Goals
To do the practicals on your local system, you are going to need the following things:
- A basic software build toolchain including a libc development package and a linker.
- A Rust installation based on the official
rustup
installer.1 - The HDF5 and hwloc libraries.
- An implementation of the
pkg-config
command. This does not have to be the original implementation,pkgconf
on Unices andpkgconfiglite
on Windows work fine too.
Other tools which are not absolutely necessary but will prove convenient include…
- A code editor with at least Rust syntax highlighting, and preferably TOML
syntax highlighting and support for the
rust-analyzer
language server too.- See the index of the “Installation”
section of the
rust-analyzer
manual for a fairly comprehensive list of supported editors and language server setup steps for each.
- See the index of the “Installation”
section of the
- A POSIX shell and the
lscpu
,unzip
andwget
command-line utilities. These will allow you to run the few non-Cargo commands featured in this course to automate some tasks.
The remainder of this chapter will guide you towards installing and setting up these dependencies, except for the code editor which we will treat as a personal choice and leave you in charge of.
Third-party cargo
and rustc
packages from Linux distributions,
Homebrew and friends cannot be used during this course because we are
going to need a nightly version of the compiler in order to demonstrate an
important upcoming language feature pertaining to SIMD computations.
OS-specific steps
Please pick your operating system for setup instructions:
Linux
System packages
There are many different Linux distributions, that each use different package managemers and name packages in a different way. The following commands should cover the most common Linux distributions, feel free to suggest your favorite distribution if it is not covered.
Debian/Ubuntu family
sudo apt-get update \
&& sudo apt-get install build-essential ca-certificates curl \
libhdf5-dev libhwloc-dev libudev-dev pkg-config \
unzip util-linux wget
Fedora family
sudo dnf makecache --refresh \
&& sudo dnf group install c-development \
&& sudo dnf install ca-certificates curl \
hdf5-devel hwloc-devel libudev-devel pkg-config \
unzip util-linux wget
RHEL family
sudo dnf makecache --refresh \
&& sudo dnf groupinstall "Devlopment tools" \
&& sudo dnf install epel-release \
&& sudo /usr/bin/crb enable \
&& sudo dnf makecache --refresh \
&& sudo dnf install ca-certificates curl \
hdf5-devel hwloc-devel libudev-devel pkg-config \
unzip util-linux wget
Arch family
sudo pacman -Sy \
&& sudo pacman -S base-devel ca-certificates curl \
libhdf5 libhwloc pkg-config \
unzip util-linux wget
openSUSE family
sudo zypper ref \
&& sudo zypper in -t pattern devel_C_C++ \
&& sudo zypper in ca-certificates curl \
hdf5-devel hwloc-devel libudev-devel pkg-config \
unzip util-linux wget
Rust
To compile Rust code, you will need a Rust toolchain. You can install the one we need like this:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain none \
&& . "$HOME/.cargo/env" \
&& rustup toolchain install nightly-2024-10-28
If you want to use rust-analyzer
, you may want to add a --component rust-analyzer
flag to the rustup toolchain install
command at the end. This
will ensure that you get a rust-analyzer
version that is fully compatible with
the compiler version that we are going to use.
HDF5-to-PNG renderer
Throughout the second part of the course, we will be producing HDF5 files which contain time series of tabulated chemical concentrations. For quick and dirty debugging, it is convenient to render those tabulated concentrations into colorful PNG images.
To this end, you can use the data-to-pics
program which was developed as part
of a previous version of this course. It can be installed as follows:
cargo install --git https://github.com/HadrienG2/grayscott.git data-to-pics
Environment test
Your Rust development environment should now be ready for this course’s practical work. I highly advise testing it by using the following shell script:
wget https://plmlab.math.cnrs.fr/grasland/numerical-rust-cpu/-/archive/solution/numerical-rust-cpu-solution.zip \
&& unzip numerical-rust-cpu-solution.zip \
&& rm numerical-rust-cpu-solution.zip \
&& cd numerical-rust-cpu-solution/exercises \
&& cargo run -- -n3 \
&& mkdir pics \
&& data-to-pics -o pics/ \
&& cd ../.. \
&& rm -rf numerical-rust-cpu-solution
It downloads, builds and runs the expected source code at the end of the last chapter of this course, then renders the associated images, and finally cleans up after itself by deleting everything.
macOS
Homebrew
To install CLI tools and libraries on macOS while retaining basic sanity, it is strongly recommended to use the Homebrew package manager. If you have not already installed it, it can be done like this:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
This will also prompt you to install the “Xcode command line tools” if you have not done so already. You will need those to build any software, including Rust software, so that’s a good thing.
CLI tools and libraries
Now that the basics are covered, we can install all needed packages as follows:
brew install curl hdf5 hwloc pkgconf unzip util-linux wget
You may get a warning stating that pkgconf
can’t be installed because
pkg-config
is installed. This is harmless: they are two compatible
implementations of the same functionality.
Rust
To compile Rust code, you will need a Rust toolchain. You can install the one we need like this:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain none \
&& . "$HOME/.cargo/env" \
&& rustup toolchain install nightly-2024-10-28
If you want to use rust-analyzer
, you may want to add a --component rust-analyzer
flag to the rustup toolchain install
command at the end. This
will ensure that you get a rust-analyzer
version that is fully compatible with
the compiler version that we are going to use.
HDF5-to-PNG renderer
Throughout the second part of the course, we will be producing HDF5 files which contain time series of tabulated chemical concentrations. For quick and dirty debugging, it is convenient to render those tabulated concentrations into colorful PNG images.
To this end, you can use the data-to-pics
program which was developed as part
of a previous version of this course. It can be installed as follows:
cargo install --git https://github.com/HadrienG2/grayscott.git data-to-pics
Environment test
Your Rust development environment should now be ready for this course’s practical work. I highly advise testing it by using the following shell script:
wget https://plmlab.math.cnrs.fr/grasland/numerical-rust-cpu/-/archive/solution/numerical-rust-cpu-solution.zip \
&& unzip numerical-rust-cpu-solution.zip \
&& rm numerical-rust-cpu-solution.zip \
&& cd numerical-rust-cpu-solution/exercises \
&& cargo run -- -n3 \
&& mkdir pics \
&& data-to-pics -o pics/ \
&& cd ../.. \
&& rm -rf numerical-rust-cpu-solution
It downloads, builds and runs the expected source code at the end of the last chapter of this course, then renders the associated images, and finally cleans up after itself by deleting everything.
Windows
Native vs WSL
Windows is not the easiest OS to set up for native HPC software development.
If you like to use the Windows Subsystem for Linux, the easiest route will likely be for you to open a WSL shell and jump to the Linux instructions. By default WSL sets up an Ubuntu environment, where the Ubuntu family’s instructions should work.
The main thing that you’ll lose with respect to a native environment, is the
ability to easily use rust-analyzer
in your code editor. You could get there
by running a Linux editor (e.g. the Linux version of VSCode) inside of WSL, but
the UI will likely have a non-native look and sub-par responsiveness.
If you really want a native (MSVC-based) Rust development environment, read on.
Rust and Visual Studio
The Windows system libraries and standard linker are distributed as part of the Visual Studio IDE1. It comes in several editions of very different pricing (from freeware to $250/month) and licensing terms, and it is generally speaking a bit of a hassle to set up.
However, learning by writing code during this tutorial qualifies as personal use, which means that you can use the free Community edition. And if you have not set it up already yourself, you can let the Rust toolchain installer automate away all the installation steps for you, as a prelude to the installation of the Rust toolchain. All you will need to do, then is wait for the various installation steps to complete. The process of installing Visual Studio will take a good while, and seem stuck at time, but be patient, it will eventually complete.
To get a Rust toolchain, and Visual Studio if you need it, just go to the Rust project’s “Getting started” page, download the rustup-init tool (you will probably want the 64-bit version unless your computer is very old), run it, and follow the instructions of the installation wizard.
Not to be confused with Visual Studio Code, which is just a code editor and does not contain the required tooling to build Windows software. There are some alternatives to Visual Studio for this purpose, like the GCC-based MinGW-w64, but using them makes you a second-class citizen of the Windows developer ecosystem, e.g. you lose the ability to use pre-built libraries and convenient tools like vcpkg.
git
and vcpkg
In the dark ages of Windows software development, libraries were considered so hard to build that the normal way to use them was to download pre-built binaries from the author’s website.
This wasn’t great because different versions of the Visual Studio toolchain are ABI-incompatible with each other, which means that you can only reliably use a binary library with the Visual Studio version that has been used to compile it.
Library authors were thus expected to maintain binaries for every Visual Studio release since the dawn of the universe, and understandably didn’t. Which means that sooner or later you would end up wanting to use two libraries whose binaries had no Visual Studio release in common.
This problem is nowadays resolved by using vcpkg, which brings some of the greatness of Unix-style package management to Windows. It can build libraries automatically for your Visual Studio version, installing dependencies as needed, like e.g. Homebrew would on macOS.
Before you can install vcpkg
, however, you will need to install the git
version control software if you have not done so already. This can be done using
the Windows installer from the official
website. You will get asked many questions
during installation, but if in doubt, you can just go with the default choices,
they are quite reasonable.
Once git
is installed, you can open a PowerShell terminal in the directory
where you would like to install vcpkg (preferably a subdirectory of your user
directory so you don’t encounter permission issues and can update it easily) and
run the following PowerShell script:
Invoke-Command -ScriptBlock {
$ErrorActionPreference="Stop"
git clone https://github.com/microsoft/vcpkg.git
Set-Location vcpkg
.\bootstrap-vcpkg.bat
}
To make the resulting vcpkg installation easier to use, it is a good idea to go to the environment variables settings panel (which you can easily find by typing “Variables” in the Windows search bar), and modify the user-specific environment variables as follows:
- Create a new
VCPKG_ROOT
variable that points to your vcpkg installation. - Add
%VCPKG_ROOT%
to thePath
list.
After this is done, you should be able to close all your PowerShell windows,
open a new one and type in vcpkg --help
. You will then get a description
of the vcpkg command-line options in return.
pkgconf
, hwloc
and hdf5
With vcpkg
at hand, we can now easily install our remaining dependencies:
vcpkg install pkgconf hwloc hdf5
However, these dependencies end up somewhere deep inside vcpkg
’s installation
directory, where build scripts won’t find them. We need to tweak
our user environment variables again to fix that:
- Add all of the following to the
Path
list:%VCPKG_ROOT%\installed\x64-windows\bin
%VCPKG_ROOT%\installed\x64-windows\tools\pkgconf
- Add a new variable called
PKG_CONFIG_PATH
with the value%VCPKG_ROOT%\installed\x64-windows\lib\pkgconfig
.
Again, after adjusting the variables, you will need to close all your PowerShell windows and open a new one for the environment changes to take effect.
Finally, vcpkg
builds hwloc as hwloc.lib
, whereas the generated pkgconfig
file tells the program linker to look for libhwloc.lib
, and the same goes for
hdf5.lib
. In an ideal world, we would work around this with a symlink. But
considering that symlinks are a privileged admin feature on Windows for reasons
only known to Microsoft, a dumb copy will be a lot easier and almost as good:
Copy-Item $env:VCPKG_ROOT\installed\x64-windows\lib\hdf5.lib `
$env:VCPKG_ROOT\installed\x64-windows\lib\libhdf5.lib
Copy-Item $env:VCPKG_ROOT\installed\x64-windows\lib\hwloc.lib `
$env:VCPKG_ROOT\installed\x64-windows\lib\libhwloc.lib
And with this, your Rust build environment should be ready.
HDF5-to-PNG renderer
Throughout the second part of the course, we will be producing HDF5 files which contain time series of tabulated chemical concentrations. For quick and dirty debugging, it is convenient to render those tabulated concentrations into colorful PNG images.
To this end, you can use the data-to-pics
program which was developed as part
of a previous version of this course. It can be installed as follows:
cargo install --git https://github.com/HadrienG2/grayscott.git data-to-pics
Environment test
Your Rust development environment should now be ready for this course’s practical work. I highly advise testing it by using the following PowerShell script:
Invoke-Command -ScriptBlock {
$ErrorActionPreference="Stop"
Invoke-WebRequest https://plmlab.math.cnrs.fr/grasland/numerical-rust-cpu/-/archive/solution/numerical-rust-cpu-solution.zip `
-OutFile solution.zip
Expand-Archive solution.zip -DestinationPath .
Remove-Item solution.zip
Set-Location numerical-rust-cpu-solution/exercises
cargo run -- -n3
New-Item -Name pics -ItemType directory
data-to-pics -o pics
Set-Location ..\..
Remove-Item numerical-rust-cpu-solution -Recurse
}
It downloads, builds and runs the expected source code at the end of the last chapter of this course, then renders the associated images, and finally cleans up after itself by deleting everything.
Training-day instructions
Expectations and conventions
Welcome to this practical about writing high performance computing in Rust!
This course assumes that the reader has basic familiarity with C (especially number types, arithmetic operations, string literals and stack vs heap). It will thus not explain concepts which are rigorously identical between Rust and C for the sake of concision. If this is not your case, feel free to ask the teacher about any surprising construct in the course’s material.
We will also compare Rust with C++ where they differ, so that readers familiar with C++ can get a good picture of Rust specificities. But previous knowledge of C++ should not be necessary to get a good understanding of Rust via this course.
Finally, we will make heavy use of “C/++” abbreviation as a shorter alternative to “C and C++” when discussing common properties of C and C++, and how they compare to Rust.
Exercises source code
At the time where you registered, you should have been directed to instructions for setting up your development environment. If you did not follow these instructions yet, now is the right time!
Now that the course has begun, we will download a up-to-date copy of the
exercises’ source code and unpack it somewhere inside of your
development environement. This will create a subdirectory called exercises/
in
which we will be working during the rest of the course.
Please pick your environement below in order to get appropriate instructions:
Get a PowerShell terminal, then cd
into the place where you would like to download
the exercises’ source code and run the following script:
Invoke-Command -ScriptBlock {
$ErrorActionPreference="Stop"
if (Test-Path exercises) {
throw "ERROR: Please move or delete the existing 'exercises' subdirectory"
}
Invoke-WebRequest https://numerical-rust-cpu-d1379d.pages.math.cnrs.fr/setup/exercises.zip `
-OutFile exercises.zip
Expand-Archive exercises.zip -DestinationPath .
Remove-Item exercises.zip
Set-Location exercises
}
General advice
The exercises are based on code examples that are purposely incorrect.
Therefore, any code example within the provided exercises
Rust project, except
for 00-hello.rs
, will either fail to compile or fail to run to completion. A
TODO code comment or … symbol will indicate where failures are expected, and
your goal in the exercises will be to modify the code in such a way that the
associated example will compile and run. For runtime failures, you should not
need to change the failing assertion, instead you will need to change other code
such that the assertion passes.
If you encounter any failure which does not seem expected, or if you otherwise get stuck, please call the trainer for guidance!
With that being said, let’s get started with actual Rust code. You can move to the next page, or any other page within the course for that matter, through the following means:
- Left and right keyboard arrow keys will switch to the previous/next page. Equivalently, arrow buttons will be displayed at the end of each page, doing the same thing.
- There is a menu on the left (not shown by default on small screen, use the top-left button to show it) that allows you to quickly jump to any page of the course. Note, however, that the course material is designed to be read in order.
- With the magnifying glass icon in the top-left corner, or the “S” keyboard shortcut, you can open a search bar that lets you look up content by keywords.
First steps
Welcome to Rust computing. This chapter will be a bit longer than the next ones, because we need to introduce a number of basic concepts that you will likely all need to do subsequent exercises. Please read on carefully!
Hello world
Following an ancient tradition, let us start by displaying some text on stdout:
fn main() { println!("Hello world!"); }
Notice the following:
- In Rust, functions declarations start with the
fn
keyword. Returning a value is optional, if you do so the return value’s type comes after the parameter list (as infn myfunc() -> f32
). - Like in C/++, the main function of a program is called
main
. - It is legal to return nothing from
main
. Likereturn 0;
in C/++, it indicates success. - Sending a line of text to standard output can be done using the
println!()
macro. We’ll get back to why it is a macro later, but in a nutshell it allows controlling the formatting of values in a way that is similar toprintf()
in C and f-strings in Python.
Variables
Rust is in many ways a hybrid of programming languages from the C/++ family and the ML family (Ocaml, Haskell). Following the latter’s tradition, it uses a variable declaration syntax that is very reminiscent of mathematical notation:
#![allow(unused)] fn main() { let x = 4; }
As in mathematics, variables are immutable by default, so the following code does not compile:
#![allow(unused)] fn main() { let x = 4; x += 2; // ERROR: Can't modify non-mut variable }
If you want to modify a variable, you must make it mutable by adding a mut
keyword after let
:
#![allow(unused)] fn main() { let mut x = 4; x += 2; // Fine, variable is declared mutable }
This design nudges you towards using immutable variables for most things, as in mathematics, which tends to make code easier to reason about.
Alternatively, Rust allows for variable shadowing, so you are allowed to define a new variable with the same name as a previously declared variable, even if it has a different type:
#![allow(unused)] fn main() { let foo = 123; // Here "foo" is an integer let foo = "456"; // Now "foo" is a string and old integer "foo" is not reachable }
This pattern is commonly used in scenarios like parsing where the old value should not be needed after the new one has been declared. It is otherwise a bit controversial, and can make code harder to read, so please don’t abuse this possibility.
Type inference
What gets inferred
Rust is a strongly typed language like C++, yet you may have noticed that the variable declarations above contain no types. That’s because the language supports type inference as a core feature: the compiler can automatically determine the type of variables based on various information.
-
First, the value that is affected to the variable may have an unambiguous type. For example, string literals in Rust are always of type
&str
(“reference-to-string”), so the compiler knows that the following variables
must be of type&str
:#![allow(unused)] fn main() { let s = "a string literal of type &str"; }
-
Second, the way a variable is used after declaration may give its type away. If you use a variable in a context where a value of type
T
is expected, then that variable must be of typeT
.For example, Rust provides a heap-allocated variable-sized array type called
Vec
(similar tostd::vector
in C++), whose length is defined to be of typeusize
(similar tosize_t
in C/++). Therefore, if you use an integer as the length of aVec
, the compiler knows that this integer must be of typeusize
:#![allow(unused)] fn main() { let len = 7; // This integer variable must be of type usize... let v = vec![4.2; len]; // ...because it's used as the length of a Vec here. // (we'll introduce this vec![] macro later on) }
-
Finally, numeric literals can be annotated to force them to be of a specific type. For example, the literal
42i8
is a 8-bit signed integer, the literal666u64
is a 64-bit unsigned integer, and the12.34f32
literal is a 32-bit (“single precision”) IEEE-754 floating point number. By this logic, the following variablex
is a 16-bit unsigned integer:#![allow(unused)] fn main() { let x = 999u16; }
If none of the above rules apply, then by default, integer literals will be inferred to be of type
i32
(32-bit signed integer), and floating-point literals will be inferred to be of typef64
(double-precision floating-point number), as in C/++. This ensures that simple programs compile without requiring type annotations.Unfortunately, this last fallback rule is not 100% reliable. There are a number of common code patterns that will not trigger it, typically involving some form of genericity.
What does not get inferred
There are cases where these three rules will not be enough to determine a variable’s type. This happens in the presence of generic type and functions.
Getting back to the Vec
type, for example, it is actually a generic type
Vec<T>
where T
can be almost any Rust type1. As with std::vector
in
C++, you can have a Vec<i32>
, a Vec<f32>
, or even a Vec<MyStruct>
where
MyStruct
is a data structure that you have defined yourself.
This means that if you declare empty vectors like this…
#![allow(unused)] fn main() { // The following syntaxes are strictly equivalent. Neither compile. See below. let empty_v1 = Vec::new(); let empty_v2 = vec![]; }
…the compiler has no way to know what kind of Vec
you are dealing with. This
cannot be allowed because the properties of a generic type like Vec<T>
heavily
depend on what concrete T
parameter it’s instantiated with (e.g. Vec
equality is only defined when the inner data type has an equality operator).
Therefore the above code does not compile.
In that case, you can enforce a specific variable type using type ascription:
#![allow(unused)] fn main() { // The following syntaxes are almost equivalent. // In both cases, the compiler knows this is a Vec<bool> because we said so let empty_vb1: Vec<bool> = Vec::new(); // Specify the type of empty_vb1 directly let empty_vb2 = Vec::<bool>::new(); // Specify the type of Vec we are creating }
…with the exception of dynamic-sized types, an advanced topic which we cannot afford to cover during this short course. Ask the teacher if you really want to know ;)
Inferring most things is the idiom
If you are coming from another programming language where type inference is either not provided, or very hard to reason about as in C++, you may be tempted to use type ascription to give an explicit type to every single variable. But I would advise resisting this temptation for a few reasons:
- Rust type inference rules are much simpler than those of C++. It only takes a small amount of time to “get them in your head”, and once you do, you will get more concise code that is less focused on pleasing the type system and more on performing the task at hand.
- Doing so is the idiomatic style in the Rust ecosystem. If you don’t follow it, your code will look odd to other Rust developers, and you will have a harder time reading code written by other Rust developers.
- If you have any question about inferred types in your program, Rust comes with
excellent IDE support via
rust-analyze
, so it is easy to configure your code editor to make it display inferred types, either all the time or on mouse hover.
But of course, there are limits to the “infer everything” approach. If every single type in the program was inferred, then a small change somewhere in the implementation your program could non-locally change the type of many other variables in the program, or even in client code, resulting in accidental API breakages, as commonly seen in dynamically typed programming language.
For this reason, Rust will not let you use type inference in entities that may appear in public APIs, like function signatures or struct declarations. This means that in Rust code, type inference will be restricted to the boundaries of a single function’s implementation. Which makes it more reliable and easier to reason about, as long as you do not write huge functions.
Back to println!()
With variable declarations out of the way, let’s go back to our hello world example and investigate the Rust text formatting macro in more details.
Remember that at the beginning of this chapter, we wrote this hello world statement:
#![allow(unused)] fn main() { println!("Hello world!"); }
This program called the println
macro with a string literal as an argument.
Which resulted in that string being written to the program’s standard output,
followed by a line feed.
If all we could pass to println
was a string literal, however, it wouldn’t
need to be a macro. It would just be a regular function.
But like f-strings in Python, the println
provides a variety of text
formatting operations, accessible via curly braces. For example, you can
interpolate variables from the outer scope…
#![allow(unused)] fn main() { let x = 4; // Will print "x is 4" println!("x is {x}"); }
…pass extra arguments in a style similar to printf in C…
#![allow(unused)] fn main() { let s = "string"; println!("s is {}", s); }
…and name arguments so that they are easier to identify in complex usage.
#![allow(unused)] fn main() { println!("x is {x} and y is {y}. Their sum is {x} + {y} = {sum}", x = 4, y = 5, sum = 4 + 5); }
You can also control how these arguments are converted to strings, using a mini-language that is described in the documentation of the std::fmt module from the standard library.
For example, you can enforce that floating-point numbers are displayed with a certain number of decimal digits:
#![allow(unused)] fn main() { let x = 4.2; // Will print "x is 4.200000" println!("x is {x:.6}"); }
println!()
is part of a larger family of text formatting and text I/O macros
that includes…
print!()
, which differs fromprintln!()
by not adding a trailing newline at the end. Beware that since stdout is line buffered, this will result in no visible output until the nextprintln!()
, unless the text that you are printing contains the\n
line feed character.eprint!()
andeprintln!()
, which work likeprint!()
andprintln!()
but write their output to stderr instead of stdout.write!()
andwriteln!()
, which take a byte or text output stream2 as an extra argument and write down the specified text there. This is the same idea asfprintf()
in C.format!()
, which does not write the output string to any I/O stream, but instead builds a heap-allocatedString
containing it for later use.
All of these macros use the same format string mini-language as println!()
,
although their semantics differ. For example, write!()
takes an extra output
stream arguments, and returns a Result
to account for the possibility of I/O
errors. Since these errors are rare on stdout and stderr, they are just treated
as fatal by the print!()
family, keeping the code that uses them simpler.
In Rust’s abstraction vocabulary, text can be written to implementations
of one of the std::io::Write
and std::fmt::Write
traits. We will
discuss traits much later in this course, but for now you can think of a
trait as a set of functions and properties that is shared by multiple
types, allowing for type-generic code. The distinction between io::Write
and fmt::Write
is that io::Write
is byte-oriented and fmt::Write
latter is text-oriented. We need this distinction because not every byte
stream is a valid UTF-8 text stream.
From Display
to Debug
So far, we have been printing simple numerical types. What they have in common is that there is a single, universally agreed upon way to display them, modulo formatting options. So the Rust standard library can afford to incorporate this display logic into its stability guarantees.
But some other types are in a muddier situation. For example, take the Vec
dynamically-sized array type. You may think that something like “[1, 2, 3, 4,
5]” would be a valid way to display an array containing the numbers containing
numbers from 1 to 5. But what happens when the array contains billions of
numbers? Should we attempt to display all of them, drowning the user’s terminal
in endless noise and slowing down the application to a crawl? Or should we
summarize the display in some way like numpy and pandas do in Python?
There is no single right answer to this kind of question, and attempting to account for all use cases would bloat up Rust’s text formatting mini-language very quickly. So instead, Rust does not provide a standard text display for these types, and therefore the following code does not compile:
#![allow(unused)] fn main() { // ERROR: Type Vec does not implement Display println!("{}", vec![1, 2, 3, 4, 5]); }
All this is fine and good, but we all know that in real-world programming, it is very convenient during program debugging to have a way to exhaustively display the contents of a variable. Unlike C++, Rust acknowledges this need by distinguishing two different ways to translate a typed value to text:
- The
Display
trait provides, for a limited set of types, an “official” value-to-text translation logic that should be fairly consensual, general-purpose, suitable for end-user consumption, and can be covered by library stability guarantees. - The
Debug
trait provides, for almost every type, a dumber value-to-text translation logic which prints out the entire contents of the variable, including things which are considered implementation details and subjected to change. It is purely meant for developer use, and showing debug strings to end users is somewhat frowned upon, although they are tolerated in developer-targeted output like logs or error messages.
As you may guess, although Vec
does not implement the Display
operation, it
does implement Debug
, and in the mini-language of println!()
et al, you can
access this alternate Debug
logic using the ?
formatting option:
#![allow(unused)] fn main() { println!("{:?}", vec![1, 2, 3, 4, 5]); }
As a more interesting example, strings implement both Display
and Debug
. The
Display
variant displays the text as-is, while the Debug
logic displays it
as you would type it in a program, with quotes around it and escape sequences
for things like line feeds and tab stops:
#![allow(unused)] fn main() { let s = "This is a \"string\".\nWell, actually more of an str."; println!("String display: {s}"); println!("String debug: {s:?}"); }
Both Display
and Debug
additionally support an alternate display mode,
accessible via the #
formatting option. For composite types like Vec
, this
has the effect of switching to a multi-line display (one line feed after each
inner value), which can be more readable for complex values:
#![allow(unused)] fn main() { let v = vec![1, 2, 3, 4, 5]; println!("Normal debug: {v:?}"); println!("Alternate debug: {v:#?}"); }
For simpler types like integers, this may have no effect. It’s ultimately up to
implementations of Display
and Debug
to decide what this formatting option
does, although staying close to the standard library’s convention is obviously
strongly advised.
Finally, for an even smoother printout-based debugging experience, you can use
the dbg!()
macro. It takes an expression as input and prints out…
- Where you are in the program (source file, line, column).
- What is the expression that was passed to
dbg
, in un-evaluated form (e.g.dbg!(2 * x)
would literally print “2 * x”). - What is the result of evaluating this expression, with alternate debug formatting.
…then the result is re-emitted as an output, so that the program can keep
using it. This makes it easy to annotate existing code with dbg!()
macros,
with minimal disruption:
#![allow(unused)] fn main() { let y = 3 * dbg!(4 + 2); println!("{y}"); }
Assertions
With debug formatting covered, we are now ready to cover the last major component of the exercises, namely assertions.
Rust has an assert!()
macro which can be used similarly to the C/++ macro of
the same name: make sure that a condition that should always be true if the code
is correct, is indeed true. If the condition is not true, the thread will panic.
This is a process analogous to throwing a C++ exception, which in simple cases
will just kill the program.
#![allow(unused)] fn main() { assert!(1 == 1); // This will have no user-visible effect assert!(2 == 3); // This will trigger a panic, crashing the program }
There are, however, a fair number of differences between C/++ and Rust assertions:
-
Although well-intentioned, the C/++ practice of only checking assertions in debug builds has proven to be tracherous in practice. Therefore, most Rust assertions are checked in all builds. When the runtime cost of checking an assertion in release builds proves unacceptable, you can use
debug_assert!()
instead, for assertions which are only checked in debug builds. -
Rust assertions do not abort the process in the default compiler configuration. Cleanup code will still run, so e.g. files and network conncections will be closed properly, reducing system state corruption in the event of a crash. Also, unlike unhandled C++ exceptions, Rust panics make it trivial to get a stack trace at the point of assertion failure by setting the
RUST_BACKTRACE
environment variable to 1. -
Rust assertions allow you to customize the error message that is displayed in the event of a failure, using the same formatting mini-language as
println!()
:#![allow(unused)] fn main() { let sonny = "dead"; assert!(sonny == "alive", "Look how they massacred my boy :( Why is Sonny {}?", sonny); }
Finally, one common case for wanting custom error messages in C++ is when
checking two variables for equality. If they are not equal, you will usually
want to know what are their actual values. In Rust, this is natively supported
by the assert_eq!()
and assert_ne!()
, which respectively check that two
values are equal or not equal.
If the comparison fails, the two values being compared will be printed out with Debug formatting.
#![allow(unused)] fn main() { assert_eq!(2 + 2, 5, "Don't you dare make Big Brother angry! >:("); }
Exercise
Now, go to your code editor, open the examples/01-let-it-be.rs
source file
inside of the provided exercises/
source tree, and address the TODOs in it.
Once you are done, the code should compile and run successfully. To check this, you may use the following command in your development environment’s terminal:
cargo run --example 01-let-it-be
Numbers
Since this is a numerical computing course, a large share of the material will be dedicated to the manipulation of numbers (especially floating-point ones). It is therefore essential that you get a good grasp of how numerical data works in Rust. Which is the purpose of this chapter.
Primitive types
We have previously mentioned some of Rust’s primitive numerical types. Here is the current list:
u8
,u16
,u32
,u64
andu128
are fixed-size unsigned integer types. The number indicates their storage width in bits.usize
is an unsigned integer type suitable for storing the size of an object in memory. Its size varies from one computer to another : it is 64-bit wide on most computers, but can be as narrow as 16-bit on some embedded platform.i8
,i16
,i32
,i64
,i128
andisize
are signed versions of the above integer types.f32
andf64
are the single-precision and double-precision IEEE-754 floating-point types.
This list is likely to slowly expand in the future, for example there are
proposals for adding f16
and f128
to this list (representing IEEE-754
half-precision and quad-precision floating point types respectively). But for
now, these types can only be manipulated via third-party libraries.
Literals
As you have seen, Rust’s integer and floating-point literals look very similar
to those of C/++. There are a few minor differences, for example the
quality-of-life feature to put some space between digits of large numbers uses
the _
character instead of '
…
#![allow(unused)] fn main() { println!("A big number: {}", 123_456_789); }
…but the major difference, by far, is that literals are not typed in Rust. Their type is almost always inferred based on the context in which they are used. And therefore…
- In Rust, you rarely need typing suffixes to prevent the compiler from truncating your large integers, as is the norm in C/++.
- Performance traps linked to floating point literals being treated as double precision, in situations when you actually wanted single precision computations, are less common in Rust.
Part of the reason why type inference works so well in Rust is that unlike C/++, Rust has no implicit conversions.
Conversions
In C/++, every time one performs arithmetic or assigns values to variables, the compiler will silently insert conversions between number types as needed to get the code to compile. This is problematic for two reasons:
- “Narrowing” conversions from types with many bits to types with few bits can lose important information, and thus produce wrong results.
- “Promoting” conversions from types with few bits to types with many bits can result in computations being performed with excessive precision, at a performance cost, only for the hard-earned extra result bits to be discarded during the final variable affectation step.
If we were to nonetheless apply this notion in a Rust context, there would be a third Rust-specific problem, which is that implicit conversions would break the type inference of numerical literals in all but the simplest cases. If you can pass variables of any numerical types to functions accepting any other numerical type, then the compiler’s type inference cannot know what is the numerical literal type that you actually intended to use. This would greatly limit type inference effectiveness.
For all these reasons, Rust does not allow for implicit type conversions. A
variable of type i8
can only accept values of type i8
, a variable of type
f32
can only accept values of type f32
, and so on.
If you want C-style conversions, the simplest way is to use as
casts:
#![allow(unused)] fn main() { let x = 4.2f32 as i32; }
As many Rust programmers were unhappy with the lossy nature of these casts, fancier conversions with stronger guarantees (e.g. only work if no information is lost, report an error if overflow occurs) have slowly been made available. But we probably won’t have the time to cover them in this course.
Arithmetic
The syntax of Rust arithmetic is generally speaking very similar to that of
C/++, with a few minor exceptions like !
replacing ~
for integer bitwise
NOT. But the rules for actually using these operators are quite different.
For the same reason that implicit conversions are not supported, mixed
arithmetic between multiple numerical types is rarely supported in Rust. This
will often be a pain points for people used to the C/++ way, as it means that
classic C numerical expressions like 4.2 / 2
are invalid and will not compile.
Instead, you will need to get used to writing 4.2 / 2.0
.
On the flip side, Rust tries harder than C/++ to handler incorrect arithmetic operations in a sensible manner. In C/++, two basic strategies are used:
- Some operations, like overflowing unsigned integers or assigning the 123456 literal to an 8-bit integer variable, silently produce results that violate mathematical intuition.
- Other operations, like overflowing signed integers or casting floating-point NaNs and infinities to integers, result in undefined behavior. This gives the compiler and CPU license to trash your entire program (not just the function that contains the faulty instruction) in unpredictable ways.
As you may guess by the fact that signed and unsigned integer operations are treated differently, it is quite hard to guess which strategy is being used, even though one is obviously a lot more dangerous than the other.
But due to the performance impact of checking for arithmetic errors at runtime, Rust cannot systematically do so and remain performance-competitive with C/++. So a distinction is made between debug and release builds:
- In debug builds, invalid arithmetic stops the program using panics. Rust panics are similar to C++ exceptions, except you are not encouraged to recover from them. They are meant to stop buggy code before it does more damage, not to report “expected” error conditions.
- In release builds, invalid Rust arithmetic silently produces wrong results, but it never causes undefined behavior.
As one size does not fit all, individual integer and floating-point types also
provide methods which re-implement the arithmetic operator with different
semantics. For example, the saturating_add()
method of integer types handle
addition overflow and underflow by returning the maximal or minimal value of the
integer type of interest, respectively:
#![allow(unused)] fn main() { println!("{}", 42u8.saturating_add(240)); // Prints 255 println!("{}", (-40i8).saturating_add(-100)); // Prints -128 }
Methods
In Rust, unlike in C++, any type can have methods, not just class-like types. As a result, most of the mathematical functions that are provided as free functions in the C and C++ mathematical libraries are provided as methods of the corresponding types in Rust:
#![allow(unused)] fn main() { let x = 1.2f32; let y = 3.4f32; let basic_hypot = (x.powi(2) + y.powi(2)).sqrt(); }
Depending on which operation you are looking at, the effectiveness of this
design choice varies. On one hand, it works great for operations which are
normally written on the right hand side in mathematics, like raising a number to
a certain power. And it allows you to access mathematical operations with less
module imports. On the other hand, it looks decidedly odd and Java-like for
operations which are normally written in prefix notation in mathematics, like
sin()
and cos()
.
If you have a hard time getting used to it, note that prefix notation can be
quite easily implemented as a library, see for example
prefix-num-ops
.
The set of operations that Rust provides on primitive types is also a fair bit broader than that provided by C/++, covering many operations which are traditionally only available via compiler intrinsics or third-party libraries in other languages. Although to C++’s credit, it must be said that this situation has actually improved in newer standard revisions.
To know which operations are available via methods, just check the documentation page for the associated arithmetic types.
Exercise
Now, go to your code editor, open the examples/02-numerology.rs
source file,
and address the TODOs in it. The code should compile and runs successfully at
the end.
To check this, you may use the following command in your development environment’s terminal:
cargo run --example 02-numerology
Loops and arrays
As a result of this course being time-constrained, we do not have the luxury of deep-diving into Rust possibilities like a full Rust course would. Instead, we will be focusing on the minimal set of building blocks that you need in order to do numerical computing in Rust.
We’ve covered variables and basic debugging tools in the first chapter, and we’ve covered integer and floating-point arithmetic in the second chapter. Now it’s time for the last major language-level component of numerical computations: loops, arrays, and other iterable constructs.
Range-based loop
The basic syntax for looping over a range of integers is simple enough:
#![allow(unused)] fn main() { for i in 0..10 { println!("i = {i}"); } }
Following an old tradition, ranges based on the ..
syntax are left inclusive
and right exclusive, i.e. the left element is included, but the right element is
not included. The reasons why this is a good default have been explained at
length
elsewhere,
so we will not repeat them here.
However, Rust acknowledges that ranges that are inclusive on both sides also have their uses, and therefore they are available through a slightly more verbose syntax:
#![allow(unused)] fn main() { println!("Fortran and Julia fans, rejoice!"); for i in 1..=10 { println!("i = {i}"); } }
The Rust range types are actually used for more than iteration. They accept
non-integer bounds, and they provide a contains()
method to check that a value
is contained within a range. And all combinations of inclusive, exclusive, and
infinite bounds are supported by the language, even though not all of them can
be used for iteration:
- The
..
infinite range contains all elements in some ordered set x..
ranges start at a certain value and contain all subsequent values in the set..y
and..=y
ranges start at the smallest value of the set and contain all values up to an exclusive or inclusive upper bound- The
Bound
standard library type can be used to cover all other combinations of inclusive, exclusive, and infinite bounds, via(Bound, Bound)
tuples
Iterators
Under the hood, the Rust for
loop has no special support for ranges of
integers. Instead, it operates over a pair of lower-level standard library
primitives called
Iterator
and
IntoIterator
.
These can be described as follows:
- A type that implements the
Iterator
trait provides anext()
method, which produces a value and internally modifies the iterator object so that a different value will be produced when thenext()
method is called again. After a while, a specialNone
value is produced, indicating that all available values have been produced, and the iterator should not be used again. - A type that implements the
IntoIterator
trait “contains” one or more values, and provides aninto_iter()
method which can be used to create anIterator
that yields those inner values.
The for loop uses these mechanisms as follows:
#![allow(unused)] fn main() { fn do_something(i: i32) {} // A for loop like this... for i in 0..3 { do_something(i); } // ...is effectively translated into this during compilation: let mut iterator = (0..3).into_iter(); while let Some(i) = iterator.next() { do_something(i); } }
Readers familiar with C++ will notice that this is somewhat similar to STL iterators and C++11 range-base for loops, but with a major difference: unlike Rust iterators, C++ iterators have no knowledge of the end of the underlying data stream. That information must be queried separately, carried around throughout the code, and if you fail to handle it correctly, undefined behavior will ensue.
This difference comes at a major usability cost, to the point where after much debate, 5 years after the release of the first stable Rust version, the C++20 standard revision has finally decided to soft-deprecate standard C++ iterators in favor of a Rust-like iterator abstraction, confusingly calling it a “range” because the “iterator” name was already taken.1
Another advantage of the Rust iterator model is that because Rust iterator
objects are self-sufficient, they can implement methods that transform an
iterator object in various ways. The Rust Iterator
trait heavily leverages
this possibility, providing dozens of
methods
that are automatically implemented for every standard and user-defined iterator
type, even though the default implementations can be overriden for performance.
Most of these methods consume the input iterator and produce a different iterator as an output. These methods are commonly called “adapters”. Here is an example of one of them in action:
#![allow(unused)] fn main() { // Turn an integer range into an iterator, then transform the iterator so that // it only yields 1 element for 10 original elements. for i in (0..100).into_iter().step_by(10) { println!("i = {i}"); } }
One major property of these iterator adapters is that they operate lazily: transformations are performed on the fly as new iterator elements are generated, without needing to collect transformed data in intermediary collections. Because compilers are bad at optimizing out memory allocations and data movement, this way of operating is a lot better than generating temporary collections from a performance point of view, to the point where code that uses iterator adapters usually compiles down to the same assembly as an optimal hand-written while loop.
For reasons that will be explained during the next parts of this course, usage of iterator adapters is very common in idiomatic Rust code, and generally preferred over equivalent imperative programming constructs unless the latter provide a significant improvement in code readability.
Arrays and Vec
s
It is not just integer ranges that can be iterated over. Two other iterable Rust
objects of major interest to numerical computing are arrays and Vec
s.
They are very similar to std::array
and std::vector
in C++:
- The storage for array variables is fully allocated on the stack.2 In
contrast, the storage for a
Vec
’s data is allocated on the heap, using the Rust equivalent ofmalloc()
andfree()
. - The size of an array must be known at compile time and cannot change during
runtime. In contrast, it is possible to add and remove elements to a
Vec
, and the underlying backing store will be automatically resized through memory reallocations and copies to accomodate this. - It is often a bad idea to create and manipulate large arrays because they can
overflow the program’s stack (resulting in a crash) and are expensive to move
around. In contrast,
Vec
s will easily scale as far as available RAM can take you, but they are more expensive to create and destroy, and accessing their contents may require an extra pointer indirection. - Because of the compile-time size constraint, arrays are generally less
ergonomic to manipulate than
Vec
s. ThereforeVec
should be your first choice unless you have a good motivation for using arrays (typically heap allocation avoidance).
There are three basic ways to create a Rust array…
- Directly provide the value of each element:
[1, 2, 3, 4, 5]
. - State that all elements have the same value, and how many elements there are:
[42; 6]
is the same as[42, 42, 42, 42, 42, 42]
. - Use the
std::array::from_fn
standard library function to initialize each element based on its position within the array.
…and Vec
s supports the first two initialization method via the vec!
macro,
which uses the same syntax as array literals:
#![allow(unused)] fn main() { let v = vec![987; 12]; }
However, there is no equivalent of std::array::from_fn
for Vec
, as it is
replaced by the superior ability to construct Vec
s from either iterators or
C++-style imperative code:
#![allow(unused)] fn main() { // The following three declarations are largely equivalent. // Here, we need to tell the compiler that we're building a Vec, but we can let // it infer the inner data type. let v1: Vec<_> = (123..456).into_iter().collect(); let v2 = (123..456).into_iter().collect::<Vec<_>>(); let mut v3 = Vec::with_capacity(456 - 123 + 1); for i in 123..456 { v3.push(i); } assert_eq!(v1, v2); assert_eq!(v1, v3); }
In the code above, the Vec::with_capacity
constructor plays the same role as
the reserve()
method of C++’s std::vector
: it lets you tell the Vec
implementation how many elements you expect to push()
upfront, so that said
implementation can allocate a buffer of the right length from the beginning and
thus avoid later reallocations and memory movement on push()
.
And as hinted during the beginning of this section, both arrays and Vec
s
implement IntoIterator
, so you can iterate over their elements:
#![allow(unused)] fn main() { for elem in [1, 3, 5, 7] { println!("{elem}"); } }
Indexing
Following the design of most modern programming languages, Rust lets you access array elements by passing a zero-based integer index in square brackets:
#![allow(unused)] fn main() { let arr = [9, 8, 5, 4]; assert_eq!(arr[2], 5); }
However, unlike in C/++, accessing arrays at an invalid index does not result in undefined behavior that gives the compiler license to arbitrarily trash your program. Instead, the thread will just deterministically panic, which by default will result in a well-controlled program crash.
Unfortunately, this memory safety does not come for free. The compiler has to insert bounds-checking code, which may or may not later be removed by its optimizer. When they are not optimized out, these bound checks tend to make array indexing a fair bit more expensive from a performance point of view in Rust than in C/++.
And this is actually one of the many reasons to prefer iteration over manual
array and Vec indexing in Rust. Because iterators access array elements using a
predictable and known-valid pattern, they can work without bound checks.
Therefore, they can be used to achieve C/++-like performance, without relying on
faillible compiler optimizations or unsafe
code in your program.3 And
another major benefit is obviously that you cannot crash your program by using
iterators wrong.
But for those cases where you do need some manual indexing, you will likely
enjoy the enumerate()
iterator adapter, which gives each iterator element an
integer index that starts at 0 and keeps growing. It is a very convenient tool
for bridging the iterator world with the manual indexing world:
#![allow(unused)] fn main() { // Later in the course, you will learn a better way of doing this let v1 = vec![1, 2, 3, 4]; let v2 = vec![5, 6, 7, 8]; for (idx, elem) in v1.into_iter().enumerate() { println!("v1[{idx}] is {elem}"); println!("v2[{idx}] is {}", v2[idx]); } }
Slicing
Sometimes, you need to extract not just one array element, but a subset of array elements. For example, in the Gray-Scott computation that we will be working on later on in the course, you will need to work on sets of 3 consecutive elements from an input array.
The simplest tool that Rust provides you to deal with this situation is slices, which can be built using the following syntax:
#![allow(unused)] fn main() { let a = [1, 2, 3, 4, 5]; let s = &a[1..4]; assert_eq!(s, [2, 3, 4]); }
Notice the leading &
, which means that we take a reference to the original
data (we’ll get back to what this means in a later chapter), and the use of
integer ranges to represent the set of array indices that we want to extract.
If this reminds you of C++20’s std::span
, this is no coincidence. Spans are
another instance of C++20 trying to catch up with features that Rust v1 had 5
years earlier…
Manual slice extraction comes with the same pitfalls as manual indexing (costly bound checks, crash on error…), therefore Rust provides more efficient slice iterators. The most popular ones are…
chunks()
andchunk_exact()
, which cut up your array/vec into a set of consecutive slices of a certain length and provide an iterator over these slices.- For example,
chunks(2)
would yield elements at indices0..2
,2..4
,4..6
, etc. - They differ in how they handle the remaining elements of the array after
all regularly-sized chunks have been taken care of.
chunks_exact()
compiles down to more efficient code by making you handle trailing elements using a separate code path.
- For example,
windows()
, where the iterator yields overlapping slices, each shifted one array/vec element away from the previous one.- For example,
windows(2)
would yield elements at indices0..2
,1..3
,2..4
, etc. - This is exactly the iteration pattern that we need for discrete convolution, which the Gray-Scott reaction computation that we’ll study later is an instance of.
- For example,
All these methodes are not just restricted to arrays and Vec
s, you can just as
well apply them to slices, because they are actually methods of the slice
type to begin with. It just
happens that Rust, through some compiler magic,4 allows you to call slice
type methods on arrays and Vec
s, as if they were the equivalent
all-encompassing &v[..]
slice.
Therefore, whenever you are using arrays and Vec
s, the documentation of the
slice type is also worth keeping around. Which is why the official documentation
helps you at this by copying it into the documentation of the
array
and
Vec
types.
Exercise
Now, go to your code editor, open the examples/03-looping.rs
source file, and
address the TODOs in it. The code should compile and runs successfully at the
end.
To check this, you may use the following command in your development environment’s terminal:
cargo run --example 03-looping
It may be worth pointing out that replacing a major standard library abstraction like this in a mature programming language is not a very wise move. 4 years after the release of C++20, range support in the standard library of major C++ compilers is still either missing or very immature and support in third-party C++ libraries is basically nonexistent. Ultimately, C++ developers will unfortunately be the ones paying the price of this standard commitee decision by needing to live with codebases that confusingly mix and match STL iterators and ranges for many decades to come. This is just one little example, among many others, of why attempting to iteratively change C++ in the hope of getting it to the point where it matches the ergonomics of Rust, is ultimately a futile evolutionary dead-end that does the C++ community more harm than good…
When arrays are used as e.g. struct
members, they are allocated
inline, so for example an array within a heap-allocated struct
is part
of the same allocation as the hosting struct.
Iterators are themselves implemented using unsafe
, but that’s the
standard library maintainers’ problem to deal with, not yours.
Cough cough
Deref
trait cough
cough.
Squaring
As you could see if you did the previous set of exercises, we have already covered enough Rust to start doing some actually useful computations.
There is still one important building block that we are missing to make the most of Rust iterator adapters, however, and that is anonymous functions, also known as lambda functions or lexical closures in programming language theory circles.
In this chapter, we will introduce this language feature, and show how it can be used, along with Rust traits and the higher-order function pattern, to compute the square of every element of an array using fully idiomatic Rust code.
Meet the lambda
In Rust, you can define a function anywhere, including inside of another
function1. Parameter types are specified using the same syntax as variable
type ascription, and return types can be specified at the end after a ->
arrow
sign:
#![allow(unused)] fn main() { fn outsourcing(x: u8, y: u8) { fn sum(a: u8, b: u8) -> u8 { // Unlike in C/++, return statements are unnecessary in simple cases. // Just write what you want to return as a trailing expression. a + b } println!("{}", sum(x, y)); } }
However, Rust is not Python, and inner function definitions cannot capture variables from the scope of outer function definitions. In other words, the following code does not compile:
#![allow(unused)] fn main() { fn outsourcing(x: u8, y: u8) { fn sum() -> u8 { // ERROR: There are no "x" and "y" variables in this scope. x + y } println!("{}", sum()); } }
Rust provides a slightly different abstraction for this, namely anonymous functions aka lambdas aka closures. In addition to being able to capture surrounding variables, these also come with much lighter-weight syntax for simple use cases…
#![allow(unused)] fn main() { fn outsourcing(x: u8, y: u8) { let sum = || x + y; // Notice that the "-> u8" return type is inferred. // If you have parameters, their type is also inferred. println!("{}", sum()); } }
…while still supporting the same level of type annotation sophistication as full function declarations, should you need it to guide type inference or improve clarity:
#![allow(unused)] fn main() { fn outsourcing(x: u8, y: u8) { let sum = |a: u8, b: u8| -> u8 { a + b }; println!("{}", sum(x, y)); } }
The main use case for lambda functions, however, is interaction with higher-order functions: functions that take other functions as inputs and/or return other functions as output.
A glimpse of Rust traits
We have touched upon the notion of traits several time before in this course, without taking the time to really explain it. That’s because Rust traits are a complex topic, which we do not have the luxury of covering in depth in this short course.
But now that we are getting to higher-order functions, we are going to need to interact a little bit more with Rust traits, so this is a good time to expand a bit more on what Rust traits do.
Traits are the cornerstone of Rust’s genericity and polymorphism system. They let you define a common protocol for interacting with several different types in a homogeneous way. If you are familiar with C++, traits in Rust can be used to replace any of these C++ features:
- Virtual methods and overrides
- Templates and C++20 concepts, with first-class support for the “type trait” pattern
- Function and method overloading
- Implicit conversions
The main advantage of having one single complex general-purpose language feature like this, instead of many simpler narrow-purpose features, is that you do not need to deal with interactions between the narrow-purpose features. As C++ practicioners know, these can be result in quite surprising behavior and getting their combination right is a very subtle art.
Another practical advantage is that you will less often hit a complexity wall, where you hit the limits of the particular language feature that you were using and must rewrite large chunks code in terms of a completely different language feature.
Finally, Rust traits let you do things that are impossible in C++. Such as adding methods to third-party types, or verifying that generic code follows its intended API contract.
If you are a C++ practicioner and just started thinking "hold on, weren't C++20 concepts supposed to fix this generics API contract problem?", please click on the arrow for a full explanation.
Let us assume that you are writing a generic function and claim that it works with any type that has an addition operator. The Rust trait system will check that this is indeed the case as the generic code is compiled. Should you use any other type operation like, say, the multiplication operator, the compiler will error out during the compilation of the generic code, and ask you to either add this operation to the generic function’s API contract or remove it from its implementation.
In contrast, C++20 concepts only let you check that the type parameters that generic code is instantiated with match the advertised contract. Therefore, in our example scenario, the use of C++20 concepts will be ineffective, and the compilation of the generic code will succeed in spite of the stated API contract being incorrect.
It is only later, as someone tries to use your code with a type that has an addition operator but no multiplication operator (like, say, a linear algebra vector type that does not use
operator*
for the dot product), that an error will be produced deep inside of the implementation of the generic code.The error will point at the use of the multiplication operator by the implementation of the generic code. Which may be only remotely related to what the user is trying to do with your library, as your function may be a small implementation detail of a much bigger functionality. It may thus take users some mental gymnastics to figure out what’s going on. This is part of why templates have a bad ergonomics reputation in C++, the other part being that function overloading as a programming language feature is fundamentally incompatible with good compiler error messages.
And sadly this error is unlikely to be caught during your testing because generic code can only be tested by instantitating it with specific types. As an author of generic code, you are unlikely to think about types with an addition operator but no multiplication operator, since these are relatively rare in programming.
To summarize, unlike C++20 concepts, Rust traits are actually effective at making unclear compiler error messages deep inside of the implementation of generic code a thing of the past. They do not only work under the unrealistic expectation that authors of generic code are perfectly careful to type in the right concepts in the signature of generic code, and to keep the unchecked concept annotations up to date as the generic code’s implementation evolves2.
Higher order functions
One of Rust’s most important traits is the
Fn
trait, which is
implemented for types that can be called like a function. It also has a few
cousins that we will cover later on.
Thanks to special treatment by the compiler3, the Fn
trait is actually a
family of traits that can be written like a function signature, without
parameter names. So for example, an object of a type that implements the
Fn(i16, f32) -> usize
trait can be called like a function, with two parameters
of type i16
and f32
, and the call will return a result of type usize
.
You can write a generic function that accepts any object of such a type like this…
#![allow(unused)] fn main() { fn outsourcing(op: impl Fn(i16, f32) -> usize) { println!("The result is {}", op(42, 4.2)); } }
…and it will accept any matching callable object, including both regular functions, and closures:
#![allow(unused)] fn main() { fn outsourcing(op: impl Fn(i16, f32) -> usize) { println!("The result is {}", op(42, 6.66)); } // Example with a regular function fn my_op(x: i16, y: f32) -> usize { (x as f32 + 1.23 * y) as usize } outsourcing(my_op); // Example with a closure outsourcing(|x, y| { println!("x may be {x}, y may be {y}, but there is only one answer"); 42 }); }
As you can see, closures shine in this role by keeping the syntax lean and the code more focused on the task at hand. Their ability to capture environment can also be very powerful in this situation, as we will see in later chapters.
You can also use the impl Trait
syntax as the return type of a function, in
order to state that you are returning an object of a type that implements a
certain trait, without specifying what the trait is.
This is especially useful when working with closures, because the type of a closure object is a compiler-internal secret that cannot be named by the programmer:
#![allow(unused)] fn main() { /// Returns a function object with the signature that we have seen so far fn make_op() -> impl Fn(i16, f32) -> usize { |x, y| (x as f32 + 1.23 * y) as usize } }
By combining these two features, Rust programmers can very easily implement any
higher-order function that takes a function as a parameter or returns a
function as a result. And because the code of these higher-order functions is
specialized for the specific function type that you’re dealing with at compile
time, runtime performance can be much better than when using dynamically
dispatched higher-order function abstractions in other languages, like
std::function
in C++4.
Squaring numbers at last
The Iterator
trait provides a number of methods that are actually higher-order
functions. The simpler of them is the map
method, which consumes the input
iterator, takes a user-provided function, and produces an output iterator whose
elements are the result of applying the user-provided function to each element
of the input iterator:
#![allow(unused)] fn main() { let numbers = [1.2f32, 3.4, 5.6]; let squares = numbers.into_iter() .map(|x| x.powi(2)) .collect::<Vec<_>>(); println!("{numbers:?} squared is {squares:?}"); }
And thanks to good language design and heroic optimization work by the Rust compiler team, the result will be just as fast as hand-optimized assembly for all but the smallest input sizes5.
Exercise
Now, go to your code editor, open the examples/04-square-one.rs
source file,
and address the TODOs in it. The code should compile and runs successfully at
the end.
To check this, you may use the following command in your development environment’s terminal:
cargo run --example 04-square-one
This reflects a more general Rust design choice of letting almost everything be declared almost anywhere, for example Rust will happily declaring types inside of functions, or even inside of value expressions.
You may think that this is another instance of the C++ standardization commitee painstakingly putting together a bad clone of a Rust feature as an attempt to play catch-up 5 years after the first stable release of Rust. But that is actually not the case. C++ concepts have been in development for more than 10 years, and were a major contemporary inspiration for the development of Rust traits along with Haskell’s typeclasses. However, the politically dysfunctional C++ standardization commitee failed to reach an agreement on the original vision, and had to heavily descope it before they succeeded at getting the feature out of the door in C++20. In contrast, Rust easily succeeded at integrating a much more ambitious generics API contract system into the language. This highlights once again the challenges of integrating major changes into an established programming language, and why the C++ standardization commitee might actually better serve C++ practicioners by embracing the “refine and polish” strategy of its C and Fortran counterparts.
There are a number of language entities like this that get special treatment by the Rust compiler. This is done as a pragmatic alternative to spending more time designing a general version that could be used by library authors, but would get it in the hands of Rust developers much later. The long-term goal is to reduce the number of these exceptions over time, in order to give library authors more power and reduce the amount of functionality that can only be implemented inside of the standard library.
Of course, there is no free lunch in programming, and all this
compile-time specialization comes at the cost. As with C++ templates, the
compiler must effectively recompile higher-order functions that take
functions as a parameter for each input function that they’re called with.
This will result in compilation taking longer, consuming more RAM, and
producing larger output binaries. If this is a problem and the runtime
performance gains are not useful to your use case, you can use dyn Trait
instead of impl Trait
to switch to dynamic dispatch, which works much
like C++ virtual
methods. But that is beyond the scope of this short
course.
To handle arrays smaller than about 100 elements optimally, you will need to specialize the code for the input size (which is possible but beyond the scope of this course) and make sure the input data is optimally aligned in memory for SIMD processing (which we will cover later on).
Hadamard
So far, we have been iterating over a single array/Vec
/slice at a time. And
this already took us through a few basic computation patterns. But we must not
forget that with the tools introduced so far, jointly iterating over two Vec
s
at the same time still involves some pretty ugly code:
#![allow(unused)] fn main() { let v1 = vec![1, 2, 3, 4]; let v2 = vec![5, 6, 7, 8]; for (idx, elem) in v1.into_iter().enumerate() { println!("v1[{idx}] is {elem}"); // Ew, manual indexing with slow bound checks and panic risks... :( println!("v2[{idx}] is {}", v2[idx]); } }
Thankfully, there is an easy fix called zip()
. Let’s use it to implement the
Hadamard product!
Combining iterators with zip()
Iterators come with an adapter method called zip()
. Like for
loops, this
method expects an object of a type that implements IntoInterator
. What it does
is to consume the input iterator, turn the user-provided object into an
interator, and return a new iterator that yields pairs of elements from both the
original iterator and the user-provided iterable object:
#![allow(unused)] fn main() { let v1 = vec![1, 2, 3, 4]; let v2 = vec![5, 6, 7, 8]; /// Iteration will yield (1, 5), (2, 6), (3, 7) and (4, 8) for (x1, x2) in v1.into_iter().zip(v2) { println!("Got {x1} from v1 and {x2} from v2"); } }
Now, if you have used the zip()
method of any other programming language for
numerics before, you should have two burning questions:
- What happens if the two input iterators are of different length?
- Is this really as performant as a manual indexing-based loop?
To the first question, other programming languages have come with three typical answers:
- Stop when the shortest iterator stops, ignoring remaining elements of the other iterator.
- Treat this as a usage error and report it using some error handling mechanism.
- Make it undefined behavior and give the compiler license to randomly trash the program.
As you may guess, Rust did not pick the third option. It could reasonably have
picked the second option, but instead it opted to pick the first option. This
was likely done because error handling comes at a runtime performance cost, that
was not felt to be acceptable for this common performance-sensitive operation
where user error is rare. But should you need it, option 2 can be easily built
as a third-party library, and is therefore available via the popular
itertools
crate.
Speaking of performance, Rust’s zip()
is, perhaps surprisingly, usually just
as good as a hand-tuned indexing loop1. It does not exhibit the runtime
performance issues that you would face when using C++20’s range-zipping
operations2. And it will especially be often highly superior to manual
indexing code, which come with a risk of panics and makes you rely on the black
magic of compiler optimizations to remove indexing-associated bound checks.
Therefore, you are strongly encouraged to use zip()
liberally in your code!
Hadamard product
One simple use of zip()
is to implement the Hadamard vector product.
This is one of several different kinds of products that you can use in linear algebra. It works by taking two vectors of the same dimensionality as input, and producing a third vector of the same dimensionality, which contains the pairwise products of elements from both input vectors:
#![allow(unused)] fn main() { fn hadamard(v1: Vec<f32>, v2: Vec<f32>) -> Vec<f32> { assert_eq!(v1.len(), v2.len()); v1.into_iter().zip(v2) .map(|(x1, x2)| x1 * x2) .collect() } assert_eq!( hadamard(vec![1.2, 3.4, 5.6], vec![9.8, 7.6, 5.4]), [ 1.2 * 9.8, 3.4 * 7.6, 5.6 * 5.4 ] ); }
Exercise
Now, go to your code editor, open the examples/05-hadamantium.rs
source file,
and address the TODOs in it. The code should compile and runs successfully at
the end.
To check this, you may use the following command in your development environment’s terminal:
cargo run --example 05-hadamantium
This is not to say that the hand-tuned indexing loop is itself perfect. It will inevitably suffer from runtime performance issues caused by suboptimal data alignment. But we will discuss how to solve this problem and achieve optimal Hadamard product performance after we cover data reductions, which suffer from much more severe runtime performance problems that include this one among many others.
The most likely reason why this is the case is that Rust pragmatically
opted to make tuples a primitive language type that gets special support
from the compiler, which in turn allows the Rust compiler to give its LLVM
backend very strong hints about how code should be generated (e.g. pass
tuple elements via CPU registers, not stack pushes and pops that may or
may not be optimized out by later passes). On its side, the C++
standardization commitee did not do this because they cared more about
keeping std::tuple
a library-defined type that any sufficiently
motivated programmer could re-implement on their own given an infinite
amount of spare time. This is another example, if needed be, that even
though both the C++ and Rust community care a lot about giving maximal
power to library writers and minimizing the special nature of their
respective standard libraries, it is important to mentally balance the
benefits of this against the immense short-term efficiency of letting the
compiler treat a few types and functions specially. As always, programming
language design is all about tradeoffs.
Sum and dot
All the computations that we have discussed so far are, in the jargon of SIMD programming, vertical operations. For each element of the input arrays, they produce zero or one matching output array element. From the perspective of software performance, these vertical operations are the best-case scenario, and compilers know how to produce very efficient code out of them without much assistance from the programmer. Which is why we have not discussed performance much so far.
But in this chapter, we will now switch our focus to horizontal operations, also known as reductions. We will see why these operations are much more challenging to compiler optimizers, and then the next few chapters will cover what programmers can do to make them more efficient.
Summing numbers
One of the simplest reduction operations that one can do with an array of floating-point numbers is to compute the sum of the numbers. Because this is a common operation, Rust iterators make it very easy by providing a dedicated method for it:
#![allow(unused)] fn main() { let sum = [1.2, 3.4, 5.6, 7.8].into_iter().sum::<f32>(); }
The only surprising thing here might be the need to spell out the type of the sum. This need does not come up because the compiler does not know about the type of the array elements that we are summing. That could be handled by the default “unknown floats are f64” fallback.
The problem is instead that the sum()
method is generic in order to be able to
work with both numbers and reference to numbers (which we have not covered yet,
for now you can think of them as being like C pointers). And wherever there is
genericity in Rust, there is loss of type inference.
Dot product
Once we have zip()
, map()
and sum()
, it takes only very little work to
combine them in order to implement a simple Euclidean dot product:
#![allow(unused)] fn main() { let x = [1.2f32, 3.4, 5.6, 7.8]; let y = [9.8, 7.6, 5.4, 3.2]; let dot = x.into_iter().zip(y) .map(|(x, y)| x * y) .sum::<f32>(); }
Hardware-minded people, however, may know that we are leaving some performance and floating-point precision on the table by doing it like this. Modern CPUs come with fused multiply-add operations that are as costly as an addition or multiplication, and do not round between the multiplication and addition which results in better output precision.
Rust exposes these hardware operations1, and we can use them by switching to
a more general cousin of sum()
called fold()
:
#![allow(unused)] fn main() { let x = [1.2f32, 3.4, 5.6, 7.8]; let y = [9.8, 7.6, 5.4, 3.2]; let dot = x.into_iter().zip(y) .fold(0.0, |acc, (x, y)| x.mul_add(y, acc)); }
Fold works by initializing an accumulator variable with a user-provided value, and then going through each element of the input iterator and integrating it into the accumulator, each time producing an updated accumulator. In other words, the above code is equivalent to this:
#![allow(unused)] fn main() { let x = [1.2f32, 3.4, 5.6, 7.8]; let y = [9.8, 7.6, 5.4, 3.2]; let mut acc = 0.0; for (x, y) in x.into_iter().zip(y) { acc = x.mul_add(y, acc); } let dot = acc; }
And indeed, the iterator fold method optimizes just as well as the above imperative code, resulting in identical generate machine code. The problem is that unfortunately, that imperative code itself is not ideal and will result in very poor computational performance. We will now show how to quantify this problem, and later explain why it happens and what you can do about it.
Setting up criterion
The need for criterion
With modern hardware, compiler and operating systems, measuring the performance of short-running code snippets has become a fine art that requires a fair amount of care.
Simply surrounding code with OS timer calls and subtracting the readings may have worked well many decades ago. But nowadays it is often necessary to use specialized tooling that leverages repeated measurements and statistical analysis, in order to get stable performance numbers that truly reflect the code’s performance and are reproducible from one execution to another.
The Rust compiler has built-in tools for this, but unfortunately they are not
fully exposed by stable versions at the time of writing, as there is a
longstanding desire to clean up and rework some of the associated APIs before
exposing them for broader use. As a result, third-party libraries should be used
for now. In this course, we will be mainly using
criterion
as it is
by far the most mature and popular option available to Rust developers
today.2
Adding criterion
to a project
Because criterion
is based on the Rust compiler’s benchmarking infrastructure,
but cannot fully use it as it is not completely stabilized yet, it does
unfortunately require a fair bit of unclean setup. First you must add a
dependency on criterion
in your Rust project.
Because this course has been designed to work on HPC centers without Internet access on worker nodes, we have already done this for you in the example source code. But for your information, it is done using the following command:
cargo add --dev criterion
cargo add
is an easy-to-use tool for editing your Rust project’s Cargo.toml
configuration file in order to register a dependency on (by default) the latest
version of some library. With the --dev
option, we specify that this
dependency will only be used during development, and should not be included in
production builds, which is the right thing to do for a benchmarking harness.
After this, every time we add a benchmark to the application, we will need to
manually edit the Cargo.toml
configuration file in order to add an entry that
disables the Rust compiler’s built-in benchmark harness. This is done so that it
does not interfere with criterion
’s work by erroring out on criterion-specific
CLI benchmark options that it does not expect. The associated Cargo.toml
configuration file entry looks like this:
[[bench]]
name = "my_benchmark"
harness = false
Unfortunately, this is not yet enough, because benchmarks can be declared pretty
much anywhere in Rust. So we must additionally disable the compiler’s built-in
benchmark harness on every other binary defined by the project. For a simple
library project that defines no extra binaries, this extra Cargo.toml
configuration entry should do it:
[lib]
bench = false
It is only after we have done all of this setup, that we can get criterion
benchmarks that will reliably accept our CLI arguments, no matter how they were
started.
A benchmark skeleton
Now that our Rust project is set up properly for benchmarking, we can start
writing a benchmark. First you need to create a benches
directory in
your project if it does not already exists, and create a source file there,
named however you like with a .rs
extension.
Then you must add the criterion
boilerplate to this source file, which is
partially automated using macros, in order to get a runnable benchmark that
integrates with the standard cargo bench
tool…
use criterion::{black_box, criterion_group, criterion_main, Criterion};
pub fn criterion_benchmark(c: &mut Criterion) {
// This is just an example benchmark that you can freely delete
c.bench_function("sqrt 4.2", |b| b.iter(|| black_box(4.2).sqrt()));
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
…and finally you must add the aforementioned Cargo.toml
boilerplate so that
criterion CLI arguments keep working as expected. Assuming you unimaginatively
named your benchmark source file “benchmark.rs”, this would be…
[[bench]]
name = "benchmark"
harness = false
Writing a good microbenchmark
There are a few basic rules that you should always follow whenever you are writing a microbenchmark that measures the performance of a small function in your code, if you do not want the compiler’s optimizer to transform your benchmark in unrealistic ways:
- Any input value that is known to the compiler’s optimizer can be used to tune
the code specifically for this input value, and sometimes even to reduce the
benchmarking loop to a single iteration by leveraging the fact that the
computation always operate on the same input. Therefore, you must always hide
inputs from the compiler’s optimizer using an optimization barrier such as
criterion
’sblack_box()
. - Any output value that is not used in any way is a useless computation in the
eyes of the compiler’s optimizer. Therefore, the compiler’s optimizer will
attempt to delete any code that is involved in the computation of such values.
To avoid this, you will want again to feed results into an optimization
barrier like
criterion
’sblack_box()
.criterion
implicitly does this for any output value that you return from theiter()
API’s callback.
It’s not just about the compiler’s optimizer though. Hardware and operating systems can also leverage the regular nature of microbenchmarks to optimize performance in unrealistic ways, for example CPUs will exhibit an unusually good cache hit rate when running benchmarks that always operate on the same input values. This is not something that you can guard against, just a pitfall that you need to keep in mind when interpreting benchmark results: absolute timings are usually an overly optimistic estimate of your application’s performance, and therefore the most interesting output of microbenchmarks is actually not the raw result but the relative variations of this result when you change the code that is being benchmarked.
Finally, on a more fundamental level, you must understand that on modern hardware, performance usually depends on problem size in a highly nonlinear and non-obvious manner. It is therefore a good idea to test the performance of your functions over a wide range of problem sizes. Geometric sequences of problem sizes like 1, 2, 4, 8, 16, 32, … are often a good default choice.
Exercise
Due to Rust benchmark harness design choices, the exercise for this chapter
will, for once, not take place in the examples
subdirectory of the exercises’
source tree.
Instead, you will mainly work on the benches/06-summit.rs
source file, which
is a Criterion benchmark that was created using the procedure described above.
Implement the sum()
function within this benchmark to make it sum the elements
of its input Vec
, then run the benchmark with cargo bench --bench 06-summit
.
To correctly interpret the results, you should know that a single core of a modern x86 CPU, with a 2 GHz clock rate and AVX SIMD instructions, can perform 32 billion f32 sums per second.3
Technically, at this stage of our performance optimization journey, we can
only use these operations via a costly libm function call. This happens
because not all x86_64 CPUs support the fma
instruction family, and the
compiler has to be conservative in order to produce code that runs
everywhere. Later, we will see how to leverage modern hardware better
using the multiversion
crate.
Although criterion would be my recommendation today due to its
feature-completeness and popularity, divan
is
quickly shaping up into an interesting alternative that might make it the
recommendation in a future edition of this course. Its benefits over
criterion
include significantly improved API ergonomics, faster
measurements, and better support for code that is generic over
compile-time quantities aka “const generics”.
Recent Intel CPUs (as of 2024) introduced the ability to perform a third SIMD sum per clock cycle, which bumps the theoretical limit to 48 billion f32 sums per second per 2 GHz CPU core, or even double that on those few CPUs that support AVX-512 with native-width ALUs.
Ownership
After running the summing benchmark from the last chapter, you should have observed that its performance is quite bad. Half a nanosecond per vector element may not sound like much, but when we’re dealing with CPUs that can process tens of multiplications in that time, it’s already something to be ashamed of.
One reason why this happens is that our benchmark does not just square floats as
it should, it also generates a full Vec
of them on every iteration. That’s not
a desirable feature, as it shifts benchmark numbers away from what we are trying
to measure. So in this chapter we will study the ownership and borrowing
features of Rust that will let us reuse input vectors and stop doing this.
Some historical context
RAII in Rust and C++
Rust relies on the Resource Acquisition Is Initialization (RAII) pattern in order to automatically manage system resources like heap-allocated memory. This pattern was originally introduced by C++, and for those unfamiliar with it, here is a quick summary of how it works:
- In modern structured programming languages, variables are owned by a certain code scope. Once you exit this scope, the variable cannot be named or otherwise accessed anymore. Therefore, the state of the variable has become unobservable to the programmer, and compilers and libraries should be able to do arbitrary things to it without meaningfully affecting the observable behavior of the program.
- Library authors can leverage this scoping structure by defining destructor functions1, which are called when a variable goes out of scope. These functions are used to clean up all system state associated with the variable. For example, a variable that manages a heap allocation would use it to deallocate the associated, now-unreachable heap memory.
Move semantics and its problems in C++
One thing which historical C++ did not provide, however, was an efficient way to
move a resource from one scope to another. For example, returning RAII types
from functions could only be made efficient through copious amounts of brittle
compiler optimizer black magic. This was an undesirable state of affair, so
after experimenting with several bad solutions including
std::auto_ptr
, the C++
standardization commitee finally came up with a reasonable fix in C++11, called
move semantics.
The basic idea of move semantics is that it should be possible to transfer ownership of one system resource from one variable to another. The original variable would lose access to the resource, and give up on trying to liberate it once its scope ends. While the new variable would gain access to the resource, and become the one responsible for liberating it. Since the two variables involved can be in different scopes, this could be used to resolve the function return problem, among others.
Unfortunately, C++ move semantics were also heavily overengineered and bolted onto a 26-years old programming language standard, with the massive user base and the backwards compatibility constraints that come with it. As a result, the design was made so complicated and impenetrable that even today, few C++ developers will claim to fully understand how it works. One especially bad design decision, in retrospect, was the choice to make move semantics an opt-in feature that each user-defined types had to individually add support for. Predictably, few types in the ecosystem did, and as a result C++ move semantics have mostly become an experts-only feature that you will be very lucky to see working as advertised on anything but toy code examples.
How Rust fixed move semantics
By virtue of being first released 4 years after C++11, Rust could learn from these mistakes, and embrace a complete redesign of C++ move semantics that is easier to understand and use, safer, and reliably works as intended. More specifically, Rust move semantics improved upon C++11 by leveraging the following insight:
- C++11 move semantics have further exacerbated the dangling pointer memory safety problems that had been plaguing C++ for a long time, by adding a new dangling variable problem in the form of moved-from variables. There was a pressing need to find a solution to this new problem, that could ideally also address the original problem.
- Almost every use case of C++ move constructors and move assignment operators
could be covered by
memcpy()
-ing the bytes of the original variable into the new variable and ensuring that 1/the original variable cannot be used anymore and 2/its destructor will never run. By restricting the scope of move operations like this, they could be automatically and universally implemented for every Rust type without any programmer intervention. - For types which do not manage resources, restricting access to the original variable would be overkill, and keeping it accessible after the memory copy is fine. The two copies are independent and each one can freely on its separate way.
Moving and copying
In the general case, using a Rust value moves it. Once you have assigned the value to a variable to another, passed it to a function, or used it in any other way, you cannot use the original variable anymore. In other words, the following code snippets are all illegal in Rust:
#![allow(unused)] fn main() { // Suppose you have a Vec defined like this... let v = vec![1.2, 3.4, 5.6, 7.8]; // You cannot use it after assigning it to another variable... let v2 = v; println!("{v:?}"); // ERROR: v has been moved from // ...and you cannot use it after passing it to a function fn f(v: Vec<f32>) {} f(v2); println!("{v2:?}"); // ERROR: v2 has been moved from }
Some value types, however, escape these restrictions by virtue of being safe to
memcpy()
. For example, stack-allocated arrays do not manage heap-allocated
memory or any other system resource, so they are safe to copy:
#![allow(unused)] fn main() { let a = [1.2, 3.4, 5.6, 7.8]; let a2 = a; println!("{a:?}"); // Fine, a2 is an independent copy of a fn f(a: [f32; 4]) {} f(a2); println!("{a2:?}"); // Fine, f received an independent copy of a2 }
Types that use this alternate logic can be identified by the fact that they
implement the Copy
trait. Other types which can be copied but not via a simple memcpy()
must use
the explicit .clone()
operation from the
Clone
trait instead.
This ensures that expensive operations like heap allocations stand out in the
code, eliminating a classic performance pitfall of C++.2
#![allow(unused)] fn main() { let v = vec![1.2, 3.4, 5.6, 7.8]; let v2 = v.clone(); println!("{v:?}"); // Fine, v2 is an independent copy of v fn f(v: Vec<f32>) {} f(v2.clone()); println!("{v2:?}"); // Fine, f received an independent copy of v2 }
But of course, this is not good enough for our needs. In our motivating benchmarking example, we do not want to simply replace our benckmark input re-creation loop with a benchmark input copying loop, we want to remove the copy as well. For this, we will need references and borrowing.
References and borrowing
The pointer problem
It has been said in the past that every problem in programming can be solved by adding another layer of indirection, except for the problem of having too many layers of indirection.
Although this quote is commonly invoked when discussing API design, one has to wonder if the original author had programming language pointers and references in mind, given how well the quote applies to them. Reasoned use of pointers can enormously benefit a codebase, for example by avoiding unnecessary data movement. But if you overuse pointers, your code will rapidly turn into a slow and incomprehensible mess of pointer spaghetti.
Many attempts have been made to improve upon this situation, with interest increasing as the rise of multithreading kept making things worse. Functional programming and communicating sequential processes are probably the two most famous examples. But most of these formal models came with very strong restrictions on how programs could be written, making each of them a poor choice for a large amount of applications that did not “fit the model”.
It can be argued that Rust is the most successful take at this problem from the 2010s, by virtue of managing to build a larger ecosystem over two simple but very far-reaching sanity rules:
- Any user-reachable reference must be safe to dereference
- Almost3 all memory can be either shared in read-only mode or accessible for writing at any point in time, but never both at the same time.
Shared references
In Rust, shared references are created by applying the ampersand &
operator to
values. They are called shared references because they enable multiple variables
to share access to the same target:
#![allow(unused)] fn main() { // You can create as many shared references as you like... let v = vec![1.2, 3.4, 5.6, 7.8]; let rv1 = &v; let rv2 = rv1; // ...and they obviously implement Copy // Reading from all of them is fine println!("{v:?} {rv1:?} {rv2:?}"); }
If this syntax reminds you of how we extracted slices from arrays and vectors before, this is not a coincidence. A slice is a kind of Rust reference.
By the rules given above, a Rust reference cannot exit the scope of the variable that it points to. And a variable that has at least one reference pointing to it cannot be modified, moved, or go out of scope. More precisely, doing either of these things will invalidate the reference, so it is not possible to use the reference after this happens.
#![allow(unused)] fn main() { // This common C++ mistake is illegal in Rust: References can't exit data scope fn dangling() -> &f32 { let x = 1.23; &x } // Mutating shared data invalidates the references, so this is illegal too let mut data = 666; let r = &data; data = 123; println!("{r}"); // ERROR: r has been invalidated }
As a result, for the entire useful lifetime of a reference, its owner can assume that the reference’s target is valid and does not change. This is a very useful invariant to operate under when applying optimizations like caching to your programs.
Mutable references
Shared references are normally3 read-only. You can read from them via either
the *
dereference operator or the method call syntax that implicitly calls it,
but you cannot overwrite their target. For that you will need the mutable &mut
references:
#![allow(unused)] fn main() { let mut x = 123; let mut rx = &mut x; *rx = 666; }
Shared and mutable references operate like a compiler-verified reader-writer lock: at any point in time, data may be either accessible for writing by one code path or accessible for reading by any number of code paths.
An obvious loophole would be to access memory via the original variable. But much like shared references are invalidated when the original variable is written to, mutable references are invalidated when the original variable is either written to or read from. Therefore, code which has access to a mutable reference can assume that as long as the reference is valid, reads and writes which are made through it are not observable by any other code path or thread or execution.
This prevents Rust code from getting into situations like NumPy in Python, where modifying a variable in one place of the program can unexpectedly affect readouts from memory made by another code path thousands of lines of code away:
import numpy as np
a = np.array([1, 2, 3, 4])
b = a
# Much later, in a completely unrelated part of the program
b[0] = 0 # Will change the value of the first item of the original "a" variable
What does this give us?
In addition to generally making code a lot easier to reason about by preventing programmers from going wild with pointers, Rust references prevent many common C/++ pointer usage errors that result in undefined behavior, including but not limited to:
- Null pointers
- Dangling pointers
- Misaligned pointers
- Iterator invalidation
- Data races between concurrently executing threads
Furthermore, the many type-level guarantees of references are exposed to the
compiler’s optimizer, which can leverage them to speed up the code under the
assumption that forbidden things do not happen. This means that, for example,
there is no need for C’s restrict
keyword in Rust: almost3 every Rust
reference has restrict
-like semantics without you needing to ask for it.
Finally, an ill-known benefit of Rust’s shared XOR mutable data aliasing model is that it closely matches the hardware-level MESI coherence protocol of CPU caches, which means that code which idiomatically follows the Rust aliasing model tends to exhibit better multi-threading performance, with fewer cache ping-pong problems.
At what cost?
The main drawback of Rust’s approach is that even though it is much more flexible than many previous attempts at making pointers easier to reason about, many existing code patterns from other programming languages still do not translate nicely to it.
So libraries designed for other programming languages (like GUI libraries) may be hard to use from Rust, and all novice Rust programmers inevitable go through a learning phase colloquially known as “fighting the borrow checker”, where they keep trying to do things that are against the rules before they full internalize them.
A further-reaching consequence is that many language and library entities need
to exist in up to three different versions in order to allow working with owned
values, shared references, and mutable references. For example, the Fn
trait
actually has two cousins called FnMut
and FnOnce
:
-
The
Fn
trait we have used so far takes a shared reference to the input function object. Therefore, it cannot handle closures that can mutate internal state, and this code is illegal:#![allow(unused)] fn main() { fn call_fn(f: impl Fn()) { f() } let mut state = 42; call_fn(|| state = 43); // ERROR: Can't call a state-mutating closure via Fn }
The flip side to this is that the implementation of
call_fn()
only needs a shared reference to f to call it, which gives it maximal flexibility. -
The
FnMut
trait takes a mutable reference to the input function object. So it can handle the above closure, but now a mutable reference will be needed to call it, which is more restrictive.#![allow(unused)] fn main() { fn call_fn_mut(mut f: impl FnMut()) { f() } let mut state = 42; call_fn_mut(|| state = 43); // Ok, FnMut can handle state-mutating closure }
-
The
FnOnce
trait consumes the function object by value. Therefore, a function that implements this trait can only be called once. In exchange, there is even more flexibility on input functions, for example returning an owned value from the function is legal:#![allow(unused)] fn main() { fn call_fn_once(f: impl FnOnce() -> Vec<i16>) -> Vec<i16> { f() } let v = vec![1, 2, 3, 4, 5]; call_fn_once(|| v); // Ok, closure can move away v since it's only called once }
Similarly, we actually have not one, but up to three ways to iterate over the
elements of Rust collections, depending on if you want owned values
(into_iter()
), shared references (iter()
), or mutable references
(iter_mut()
). And into_iter()
itself is a bit more complicated than this
because if you call it on a shared reference to a collection, it will yield
shared references to elements, and if you call it on a mutable reference to a
collection, it will yield mutable references to elements.
And there is much more to this, such as the move
keyword that can be used to
force a closure to capture state by value when it would normally capture it by
reference, allowing said closure to be easily sent to a different threads of
executions… sufficient to say, the value/&
/&mut
dichotomy runs deep into
the Rust API vocabulary and affects many aspects of the language and ecosystem.
References and functions
Rust’s reference and borrowing rules interact with functions in interesting ways. There are a few easy cases that you can easily learn:
- The function only takes references as input. This requires no special precautions, since references are guaranteed to be valid for the entire duration of their existence.
- The function takes only one reference as input, and returns one or more references as output. The compiler will infer by default that the output references probably comes from the input data, which is almost always true.
- The function returns references out of nowhere. This is only valid when returning references to global data, in any other case you should thank the Rust borrow checker for catching your dangling pointer bug.
The first two cases are handled by simply replicating the reference syntax in function parameter and return types, without any extra annotation…
#![allow(unused)] fn main() { fn forward_input_to_output(x: &i32) -> &i32 { x } }
…and the third case must be annotated with the 'static
keyword to advertise
the fact that only global state belongs here:
#![allow(unused)] fn main() { fn global_ref() -> &'static f32 { &std::f32::consts::PI } }
But as soon as a function takes multiple references as input, and return one reference as an output, you need4 to specify which input(s) the output reference can comes from, as this affects how other code can use your function. Rust handles this need via lifetime annotations, which look like this:
#![allow(unused)] fn main() { fn output_tied_to_x<'a>(x: &'a i32, y: &f32) -> &'a i32 { x } }
Lifetime annotations as a language concept can take a fair amount of time to
master, so my advice to you as a beginner would be to avoid running into them at
the beginning of your Rust journey, even if it means sprinkling a few .clone()
here and there. It is possible to make cloning cheaper via reference counting if
need be, and this will save you from the trouble of attempting to learn all the
language subtleties of Rust at once. Pick your battles!
Exercise:
Modify the benchmark from the previous chapter so that the input gets generated
once outside of c.bench_function()
, then passed to sum()
by reference rather
than by value. Measure performance again, and see how much it helped.
In Rust, this is the job of the
Drop
trait.
The more experienced reader will have noticed that although this rule of
thumb works well most of the time, it has some corner cases. memcpy()
itself is cheap but not free, and copying large amount of bytes can easily
become more expensive than calling the explicit copy operator of some
types like reference-counted smart pointers. At the time where this course
chapter is written, there is an ongoing
discussion
towards addressing this by revisiting the Copy
/Clone
dichotomy in
future evolutions of Rust.
Due to annoying practicalities like reference counting and mutexes in multi-threading, some amount of shared mutability has to exist in Rust. However, the vast majority of the types that you will be manipulating on a daily basis either internally follow the standard shared XOR mutable rule, or expose an API that does. Excessive use of shared mutability by Rust programs is frowned upon as unidiomatic.
There is actually one last easy case involving methods from objects that
return references to the &self
/&mut self
parameter, but we will not
have time to get into this during this short course;
SIMD
Single Instruction Multiple Data, or SIMD, is a very powerful hardware feature which lets you manipulate multiple numbers with a single CPU instruction. This means that if you play your cards right, code that uses SIMD can be from 2x to 64x faster than code that doesn’t, depending on what hardware you are running on and what data type you are manipulating.
Unfortunately, SIMD is also a major pain in the bottom as a programmer, because the set of operations that you can efficiently perform using SIMD instructions is extremely limited, and the performance of these operations is extremely sensitive to hardware details that you are not used to caring about, such as data alignment, contiguity and layout.
Why won’t the compiler do it?
People new to software performance optimization are often upset when they learn that SIMD is something that they need to take care of. They are used to optimizing compilers acting as wonderful and almighty hardware abstraction layers that usually generate near-optimal code with very little programmer effort, and rightfully wonder why when it comes to SIMD, the abstraction layer fails them and they need to take the matter into their own hands.
Part of the answer lies in this chapter’s introductory sentences. SIMD instruction sets are an absolute pain for compilers to generate code for, because they are so limited and sensitive to detail that within the space of all possible generated codes, the code that will actually work and run fast is basically a tiny corner case that cannot be reached through a smooth, progressive optimization path. This means that autovectorization code is usually the hackiest, most special-case-driven part of an optimizing compiler codebase. And you wouldn’t expect such code to work reliably.
But there is another side to this story, which is the code that you wrote. Compilers forbid themselves from performing certain optimizations when it is felt that they would make the generated code too unrelated to the original code, and thus impossible to reason about. For example, reordering the elements of an array in memory is generally considered to be off-limits, and so is reassociating floating-point operations, because this changes where floating-point rounding approximations are performed. In sufficiently tricky code, shifting roundings around can make the difference between producing a fine, reasonably accurate numerical result on one side, and accidentally creating a pseudorandom floating-point number generator on the other side.
When it comes to summing arrays of floating point numbers, the second factor actually dominates. You asked the Rust compiler to do this:
#![allow(unused)] fn main() { fn sum_my_floats(v: &Vec<f32>) -> f32 { v.into_iter().sum() } }
…and the Rust compiler magically translated your code into something like this:
#![allow(unused)] fn main() { fn sum_my_floats(v: &Vec<f32>) -> f32 { let mut acc = 0.0; for x in v.into_iter() { acc += x; } acc } }
But that loop itself is actually not anywhere close to an optimal SIMD computation, which would be conceptually closer to this:
#![allow(unused)] fn main() { // This is a hardware-dependent constant, here picked for x86's AVX const HARDWARE_SIMD_WIDTH: usize = 8; fn simd_sum_my_floats(v: &Vec<f32>) -> f32 { let mut accs = [0.0; HARDWARE_SIMD_WIDTH]; let chunks = v.chunks_exact(HARDWARE_SIMD_WIDTH); let remainder = chunks.remainder(); // We are not doing instruction-level parallelism for now. See next chapter. for chunk in chunks { // This loop will compile into a single fast SIMD addition instruction for (acc, element) in accs.iter_mut().zip(chunk) { *acc += *element; } } for (acc, element) in accs.iter_mut().zip(remainder) { *acc += *element; } // This would need to be tuned for optimal efficiency at small input sizes accs.into_iter().sum() } }
…and there is simply no way to go from sum_my_floats
to simd_sum_my_floats
without reassociating floating-point operations. Which is not a nice thing to do
behind the original code author’s back, for reasons that numerical computing
god William Kahan will explain much
better than I can in his many papers and presentations.
All this to say: yes there is unfortunately no compiler optimizer free lunch with SIMD reductions and you will need to help the compiler a bit in order to get there…
A glimpse into the future: portable_simd
Unfortunately, the influence of Satan on the design of SIMD instruction sets does not end at the hardware level. The hardware vendor-advertised APIs for using SIMD hold the dubious distinction of feeling even worse to use. But thankfully, there is some hope on the Rust horizon.
The GCC and clang compilers have long provided a SIMD hardware abstraction layer, which provides an easy way to access the common set of SIMD operations that all hardware vendors agree is worth having (or at least is efficient enough to emulate on hardware that disagrees). As is too common when compiler authors design user interfaces, the associated C API is not exactly intuitive, and is therefore rarely used by C/++ practicioners. But a small but well-motivated team has been hard at work during the past few years, building a nice high-level Rust API on top of this low-level compiler backend functionality.
This project is currently integrated into nightly versions of the Rust compiler
as the std::simd
experimental
standard library module. It is an important project because if it succeeds at
being integrated into stable Rust on a reasonable time frame, it might actually
be the first time a mainstream programming language provides a standardized1
API for writing SIMD code, that works well enough for common use cases without
asking programmers to write one separate code path for each supported hardware
instruction set.
This becomes important when you realize that x86 released more than 352 extensions to its SIMD instruction set in around 40 years of history, while Arm has been maintaining two families of incompatible SIMD instruction set extensions with completely different API logic3 across their platform for a few years now and will probably need to keep doing so for many decades to come. And that’s just the two main CPU vendors as of 2024, not accounting for the wider zoo of obscure embedded CPU architectures that application domains like space programs need to cope with. Unless you are Intel or Arm and have thousands of people-hours to spend on maintaining dozens of optimized backends for your mathematical libraries, achieving portable SIMD performance through hand-tuned hardware-specific code paths is simply not feasible anymore in the 21st century.
In contrast, using std::simd
and the multiversion
crate4, a reasonably
efficient SIMD-enabled floating point number summing function can be written
like this…
#![feature(portable_simd)] // Release the nightly compiler kraken
use multiversion::{multiversion, target::selected_target};
use std::simd::prelude::*;
#[multiversion(targets("x86_64+avx2+fma", "x86_64+avx", "x86_64+sse2"))]
fn simd_sum(x: &Vec<f32>) -> f32 {
// This code uses a few advanced language feature that we do not have time
// to cover during this short course. But feel free to ask the teacher about
// it, or just copy-paste it around.
const SIMD_WIDTH: usize = const {
if let Some(width) = selected_target!().suggested_simd_width::<f32>() {
width
} else {
1
}
};
let (peel, body, tail) = x.as_simd::<SIMD_WIDTH>();
let simd_sum = body.into_iter().sum::<Simd<f32, SIMD_WIDTH>>();
let scalar_sum = peel.into_iter().chain(tail).sum::<f32>();
simd_sum.reduce_sum() + scalar_sum
}
…which, as a famous TV show would put it, may not look great, but is definitely not terrible.
Back from the future: slipstream
& safe_arch
The idea of using experimental features from a nightly version of the Rust compiler may send shivers down your spine, and that’s understandable. Having your code occasionally fail to build because the language standard library just changed under your feet is really not for everyone.
If you need to target current stable versions of the Rust compilers, the main
alternatives to std::simd
that I would advise using are…
slipstream
, which tries to do the same thing asstd::simd
, but using autovectorization instead of relying on direct compiler SIMD support. It usually generates worse SIMD code thanstd::simd
, which says a lot about autovectorization, but for simple things, a slowdown of “only” ~2-3x with respect to peak hardware performance is achievable.safe_arch
, which is x86-specific, but provides a very respectable attempt at making the Intel SIMD intrinsics usable by people who were not introduced to programming through the afterschool computing seminars of the cult of Yog-Sothoth.
But as you can see, you lose quite a bit by settling for these, which is why
Rust would really benefit from getting std::simd
stabilized sooner rather than
later. If you know of a SIMD expert who could help at this task, please consider
attempting to nerd-snipe her into doing so!
Exercise
The practical work environement has already been configured to use a nightly release of the Rust compiler, which is version-pinned for reproducibility.
Integrate simd_sum
into the benchmark and compare it to your previous
optimized version. As often with SIMD, you should expect worse performance on
very small inputs, followed by much improved performance on larger inputs, which
will degrade back into less good performance as the input size gets so large
that you start trashing the fastest CPU caches and hitting slower memory tiers.
Notice that the #![feature(portable_simd)]
experimental feature enablement
directive must be at the top of the benchmark source file, before any other
program declaration. The other paragraphs of code can be copied anywhere you
like, including after the point where they are used.
If you would fancy a more challenging exercise, try implementing a dot product using the same logic.
Standardization matters here. Library-based SIMD abstraction layers have been around for a long while, but since SIMD hates abstraction and inevitably ends up leaking through APIs sooner rather than later, it is important to have common language vocabulary types so that everyone is speaking the same SIMD language. Also, SIMD libraries have an unfortunate tendency to only correctly cover the latest instruction sets from the most popular hardware vendors, leaving other hardware manufacturers out in the cold and thus encouraging hardware vendor lock-in that this world doesn’t need.
MMX, 3DNow!, SSE, SSE2, SSE3, SSSE3 (not a typo!), SSE4, AVX, F16C, XOP, FMA4 and FMA3, AVX2, the 19 different subsets of AVX-512, AMX, and most recently at the time of writing, AVX10.1 and the upcoming AVX10.2. Not counting more specialized ISA extension that would also arguably belong to this list like BMI and the various cryptography primitives that are commonly (ab)used by the implementation of fast PRNGs.
Which handles the not-so-trivial matter of having your code adapt at runtime to the hardware that you have, without needing to roll out one build per cursed revision of the x86/Arm SIMD instruction set.
ILP
Did you know that a modern CPU core can do multiple things in parallel? And by that, I am not just refering to running different threads on different CPU cores, or processing multiple data elements in a single CPU instruction. Each CPU core from a multicore CPU can be executing multiple instructions (possibly SIMD) at the same time, through the hardware magic of superscalar execution.
In what is becoming by now a recurring theme, however, this extra processing power does not come for free. The assembly that your code compiles into must actually feature multiple independent streams of instructions in order to feed all these superscalar execution units. If each instruction from your code depends on the previous one, as was the case in our first SIMD sum implementation, then no superscalar execution will happen, resulting in inefficient CPU hardware use.
In other words, to run optimally fast, your code must feature a form of fine-grained concurrency called Instruction-Level Parallelism, or ILP for short.
The old and cursed way
Back in the days where dinosaurs roamed the Earth and Fortran 77 was a cool new language that made the Cobol-74 folks jealous, the expert-sanctioned way to add N-way instruction-level parallelism to a performance-sensitive computation would have been to manually unroll your loops and write N copies of your computations inside of them:
#![allow(unused)] fn main() { fn sum_ilp3(v: &Vec<f32>) -> f32 { let mut accs = [0.0, 0.0, 0.0]; let num_chunks = v.len() / 3; for i in 0..num_chunks { // TODO: Add SIMD here accs[0] += v[3 * i]; accs[1] += v[3 * i + 1]; accs[2] += v[3 * i + 2]; } let first_irregular_idx = 3 * num_chunks; for i in first_irregular_idx..v.len() { accs[i - first_irregular_idx] += v[i]; } accs[0] + accs[1] + accs[2] } let v = vec![1.2, 3.4, 5.6, 7.8, 9.0]; assert_eq!(sum_ilp3(&v), v.into_iter().sum::<f32>()); }
Needless to say, this way of doing things does not scale well to more complex computations or high degrees of instruction-level parallelism. And it can also easily make code a lot harder to maintain, since one must remember to do each modification to the ILP’d code in N different places. Also, I hope for you that you will rarely if ever will need to change the N tuning parameter in order to fit, say, a new CPU architecture with different quantitative parameters.
Thankfully, we are now living in a golden age of computing where high-end fridges have more computational power than the supercomputers of the days where this advice was relevant. Compilers have opted to use some of this abundant computing power to optimize programs better, and programming languages have built on top of these optimizations to provide new features that give library writers a lot more expressive power at little to no runtime performance cost. As a result, we can now have ILP without sacrificing the maintainability of our code like we did above.
The iterator_ilp
way
First of all, a mandatory disclaimer: I am the maintainer of the iterator_ilp
library. It started as an experiment to see if the advanced capabilities of
modern Rust could be leveraged to make the cruft of copy-and-paste ILP obsolete.
Since the experiment went well enough for my needs, I am now sharing it with
you, in the hope that you will also find it useful.
The whole raison d’être of iterator_ilp
is to take the code that I showed
you above, and make the bold but proven claim that the following code compiles
down to faster1 machine code:
use iterator_ilp::IteratorILP;
fn sum_ilp3(v: &Vec<f32>) -> f32 {
v.into_iter()
// Needed until Rust gets stable generics specialization
.copied()
.sum_ilp::<3, f32>()
}
let v = vec![1.2, 3.4, 5.6, 7.8, 9.0];
assert_eq!(sum_ilp3(&v), v.into_iter().sum::<f32>());
Notice how I am able to add new methods to Rust iterators. This leverages a
powerful property of Rust traits, which is that they can be implemented for
third-party types. The requirement for using such an extension trait, as they
are sometimes called, is that the trait that adds new methods must be explicitly
brought in scope using a use
statement, as in the code above.
It’s not just that I have manually implemented a special case for floating-point
sums, however. My end goal with this library is that any iterator you can
fold()
, I should ultimately be able to fold_ilp()
into the moral
equivalent of the ugly hand-unrolled loop that you’ve seen above, with only
minimal code changes required on your side. So for example, this should
ultimately be as efficient as hand-optimized ILP:2
fn norm_sqr_ilp9(v: &Vec<f32>) -> f32 {
v.into_iter()
.copied()
.fold_ilp::<9, _>(
|| 0.0
|acc, elem| elem.mul_add(elem, acc),
|acc1, acc2| acc1 + acc2
)
}
Exercise
Use iterator_ilp
to add instruction-level parallelism to your SIMD sum, and
benchmark how close doing so gets you to the peak hardware performance of a
single CPU core.
Due to std::simd
not being stable yet, it is unfortunately not yet fully
integrated in the broader Rust numerics ecosystem, so you will not be able to
use sum_ilp()
and will need the following more verbose fold_ilp()
alternative:
use iterator_ilp::IteratorILP;
let result =
array_of_simd
.iter()
.copied()
.fold_ilp::<2, _>(
// Initialize accumulation with a SIMD vector of zeroes
|| Simd::splat(0.0),
// Accumulate one SIMD vector into the accumulator
|acc, elem| ...,
// Merge two SIMD accumulators at the end
|acc1, acc2| ...,
);
Once the code works, you will have to tune the degree of instruction-level parallelism carefully:
- Too little and you will not be able to leverage all of the CPU’s superscalar hardware.
- Too much and you will pass the limit of the CPU’s register file, which will
lead to CPU registers spilling to the stack at a great performance cost.
- Also, beyond a certain degree of specified ILP, the compiler optimizer will often just give up and generate a scalar inner loop, as it does not manage to prove that if it tried harder to optimize, it might eventually get to simpler and faster code.
For what it’s worth, compiler autovectorizers have gotten good enough that you
can actually get the compiler to generate both SIMD instructions and ILP using
nothing but iterator_ilp
with huge instruction-level parallelism. However,
reductions do not autovectorize that well for various reasons, so the
performance will be worse. Feel free to benchmark how much you lose by using
this strategy!
It’s mainly faster due to the runtime costs of manual indexing in Rust. I could rewrite the above code with only iterators to eliminate this particular overhead, but hopefully you will agree with me that it would make the code even cruftier than it already is.
Sadly, we’re not there yet today. You saw the iterator-of-reference issue
above, and there are also still some issues around iterators of tuples
from zip()
. But I know how to resolve these issues on the implementation
side, and once Rust gets generics specialization, I should be able to
automatically resolve them for you without asking you to call the API
differently.
Parallelism
So far, we have been focusing on using a single CPU core efficiently, honoring some ancient words of software performance optimization wisdom:
You can have have a second computer, once you know how to use the first one.
But as of this chapter, we have finally reached the point where our floating point sum makes good use of the single CPU core that it is running on. Therefore, it’s now time to put all those other CPU cores that have been sitting idle so far to good use.
Easy parallelism with rayon
The Rust standard library only provides low-level parallelism primitives like
threads and mutexes. However, limiting yourself to these would be unwise, as the
third-party Rust library ecosystem is full of multithreading gems. One of them
is the rayon
crate, which provides equivalents of standard Rust iterators that
automatically distribute your computation over multiple threads of execution.
Getting started with rayon
is very easy. First you add it as a dependency to
your project…
cargo add rayon
…and then you pick the computation you want to parallelize, and replace standard Rust iteration with the rayon-provided parallel iteration methods:
use rayon::prelude::*;
fn par_sum(v: &Vec<f32>) -> f32 {
// TODO: Add back SIMD, ILP, etc
v.into_par_iter().sum()
}
That’s it, your computation is now running in parallel. By default, it will use
one thread per available CPU hyperthread. You can easily tune this using the
RAYON_NUM_THREADS
environment variable. And for more advanced use cases like
comparative benchmarking at various numbers of threads, it is also possible to
configure the thread pool from your
code.
The power of ownership
Other programming languages also provide easy-looking ways to easily parallelize computations, like OpenMP in C/++ and Fortran. But it you have tried them, your experience was probably not great. It is likely that you have run into all kinds of wrong results, segfaults, and other nastiness.
This happens much more rarely in Rust, because Rust’s ownership and borrowing model has been designed from the start to make multi-threading easier. As a reminder, in Rust, you can only have one code path writing to a variable at a same time, or N code paths reading from it. So by construction, data races, where multiple threads attempt to access data and at least one is writing, cannot happen in safe Rust. This is great because data races are one of the worst kinds of parallelism bug. They result in undefined behavior, and therefore give the compiler and hardware license to trash your program in unpredictable ways, which they will gladly do.
That being said, it is easy to misunderstand the guarantees that Rust gives you and get a false sense of security from this, which will come back to bite you later on. So let’s make things straight right now: Rust does not protect against all parallelism bugs. Deadlocks, and race conditions other than data races where operations are performed in the “wrong” order, can still happen. It is sadly not the case that just because you are using Rust, you can forget about decades of research in multi-threaded application architecture and safe multi-threading patterns. Getting there will hopefully be the job of the next generation of programming language research.
What Rust does give you, however, is a language-enforced protection against the worst kinds of multi-threading bugs, and a vibrant ecosystem of libraries that make it trivial for you to apply solid multi-threading architectural patterns in your application in order to protect yourself from the other bugs. And all this power is available for you today, not hopefully tomorrow if the research goes well.
Optimizing the Rayon configuration
Now, if you have benchmarked the above parallel computation, you may have been disappointed with the runtime performance, especially at low problem size. One drawback of libraries like Rayon that implement a very general parallelism model is that they can only be performance-tuned for a specific kind of code, which is not necessarily the kind of code that you are writing right now. So for specific computations, fine-tuning may be needed to get optimal results.
In our case, one problem that we have is that Rayon automatically slices our workload into arbitrarily small chunks, down to a single floating-point number, in order to keep all of its CPU threads busy. This is appropriate when each computation from the input iterator is relatively complex, like processing a full data file, but not for simple computations like floating point sums. At this scale, the overhead of distributing work and waiting for it to complete gets much higher than the performance gain brought by parallelizing.
We can avoid this issue by giving Rayon a minimal granularity below
which work should be processed sequentially, using the par_chunks
method:
use rayon::prelude::*;
fn par_sum(v: &Vec<f32>) -> f32 {
// TODO: This parameter must be tuned empirically for your particular
// computation, on your target hardware.
v.par_chunks(1024)
.map(seq_sum)
.sum()
}
// This function will operate sequentially on slices whose length is dictated by
// the tuning parameter given to par_chunks() above.
fn seq_sum(s: &[f32]) -> f32 {
// TODO: Replace with optimized sequential sum with SIMD and ILP.
s.into_iter().sum()
}
Notice that par_chunks
method produces a parallel iterator of slices, not
Vec
s. Slices are simpler objects than Vec
s, so every Vec
can be
reinterpreted as a slice, but not every slice can be reinterpreted as a Vec
.
This is why the idiomatic style for writing numerical code in Rust is actually
to accept slices as input, not &Vec
. I have only written code that takes
&Vec
in previous examples to make your learning process easier.
With this change, we can reach the crossover point where the parallel computation is faster than the sequential one at a smaller input size. But spawning and joining a parallel job itself has fixed costs, and that cost doesn’t go down when you increase the sequential granularity. So a computation that is efficient all input sizes would be closer to this:
fn sum(v: &Vec<f32>) -> f32 {
// TODO: Again, this will require workload- and hardware-specific tuning
if v.len() > 4096 {
par_sum(v)
} else {
seq_sum(v.as_slice())
}
}
Of course, whether you need these optimizations depends on how much you are
interested at performance at small input sizes. If the datasets that you are
processing are always huge, the default rayon
parallelization logic should
provide you with near-optimal performance without extra tuning. Which is
likely why rayon
does not perform these optimizations for you by default.
Exercise
Parallelize one of the computations that you have optimized previously using
rayon
. You will not need to run cargo add rayon
in the exercises project,
because for HPC network policy reasons, the dependency needed to be added and
downloaded for you ahead of time.
Try benchmarking the computation at various number of threads, and see how it
affects performance. As a reminder, you can tune the number of threads using the
RAYON_NUM_THREADS
environment variable. The hardware limit above which you can
expect no benefit from extra threads is the number of system CPUs, which can be
queried using nproc
. But as you will see, sometimes less threads can be
better.
If you choose to implement the par_chunks()
and fully sequential fallback
optimizations, do not forget to adjust the associated tuning parameters for
optimal performance at all input sizes.
ndarray
At this point, you should know enough about numerical computing in Rust to efficiently implement any classic one-dimensional computation, from FFT to BLAS level 1 vector operations.
Sadly, this is not enough, because the real world insists on solving multi-dimensional mathematical problems1. And therefore you will need to learn what it takes to make these efficient in Rust too.
The first step to get there is to have a way to actually get multi-dimensional arrays into your code. Which will be the topic of this chapter.
Motivation
Multi-dimensional array libraries are actually not a strict requirement for numerical computing. If programming like a BASIC programmer from the 80s is your kink, you can just treat any 1D array as a multidimensional array by using the venerable index linearization trick:
#![allow(unused)] fn main() { let array = [00, 01, 02, 03, 04, 10, 11, 12, 13, 14, 20, 21, 22, 23, 24, 30, 31, 32, 33, 34]; let row_length = 5; let index_2d = [2, 1]; let index_linear = row_length * index_2d[0] + index_2d[1]; assert_eq!(array[index_linear], 21); }
There are only two major problems with this way of doing things:
- Writing non-toy code using this technique is highly error-prone. You will spend many hours debugging why a wrong result is returned, only to find out much later that you have swapped the row and column index, or did not keep the row length metadata in sync with changes to the contents of the underlying array.
- It is nearly impossible to access manually implemented multidimensional arrays
without relying on lots of manual array/
Vec
indexing. The correctness and runtime performance issues of these indexing operations in Rust should be familiar to you by now. So if we can, it would be great to get rid of them in our code.
The ndarray
crate resolves these problems by implementing all the tricky index
linearization code for you, and exposing it using types that act like a
multidimensional version of Rust’s Vec
s and slices. These types keep data and
shape metadata in sync for you, and come with multidimensional versions of all
the utility methods that you are used to coming from standard library Vec
and
slices, including multidimensional overlapping window iterators.
ndarray
is definitely not the only library doing this in the Rust ecosystem,
but it is one of the most popular ones. This means that you will find plenty of
other libraries that build on top of it, like linear algebra and statistics
libraries. Also, of all available options, it is in my opinion currently the one
that provides the best tradeoff between ease of use, runtime performance,
generality and expressive power. This makes it an excellent choice for
computations that go a bit outside of the linear algebra textbook vocabulary,
including the Gray-Scott reaction simulation that we will implement next.
Adding ndarray
to your project
As usual, adding a new dependency is as easy as cargo add
:
cargo add ndarray
But this time, you may want to look into the list of optional features that gets displayed in your console when you run this command:
$ cargo add ndarray
Updating crates.io index
Adding ndarray v0.15.6 to dependencies
Features:
+ std
- approx
- approx-0_5
- blas
- cblas-sys
- docs
- libc
- matrixmultiply-threading
- rayon
- rayon_
- serde
- serde-1
- test
By default, ndarray
keeps the set of enabled features small in order to speed
up compilation. But if you want to do more advanced numerics with ndarray
in
the future, some of the optional functionality, like integration with the system
BLAS, or parallel Rayon iterators over the contents of multidimensional arrays
may come very handy. We will, however, not need them for this particular course,
so let’s move on.
Creating Array
s
The heart of ndarray
is
ArrayBase
, a
very generic multidimensional array type that can either own its data (like
Vec
) or borrow it from a Rust slice that you obtained from some other source.
While you are learning Rust, you will likely find it easier to avoid using this
very general-purpose generic type directly, and instead work with the various
type aliases that are provided by the ndarray
crate for easier operation. For
our purposes, three especially useful aliases will be…
Array2
, which represents an owned two-dimensional array of data (similar toVec<T>
)ArrayView2
, which represents a shared 2D slice of data (similar to&[T]
)ArrayViewMut2
, which represents a mutable 2D slice of data (similar to&mut [T]
)
But because these are just type aliases, the documentation page for them will
not tell you about all available methods. So you will still want to keep the
ArrayBase
documentation
close by.
Another useful tool in ndarray is the array!
macro, which lets you create
Array
s using a syntax analogous to that of the vec![]
macro:
use ndarray::array;
let a2 = array![[01, 02, 03],
[11, 12, 13]];
You can also create Array
s from a function that maps each index to a
corresponding data element…
use ndarray::{Array, arr2};
// Create a table of i × j (with i and j from 1 to 3)
let ij_table = Array::from_shape_fn((3, 3), |(i, j)| (1 + i) * (1 + j));
assert_eq!(
ij_table,
// You can also create an array from a slice of Rust arrays
arr2(&[[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])
);
…and if you have a Vec
of data around, you can turn it into an Array
as
well.
But because a Vec
does not come with multidimensional shape information, you
will need to provide this information separately, at the risk of it going out of
sync with the source data. And you will also need to pay close attention to the
the order in which Array
elements should be provided by the input Vec
(ndarray
uses row-major order by default).
So all in all, in order to avoid bugs, it is best to avoid these conversions and stick with the ndarray APIs whenever you can.
Iteration
Rust iterators were designed for one-dimensional data, and using them for multidimensional data comes at the cost of losing useful information. For example, they cannot express concepts like “here is a block of data that is contiguous in memory, but then there is a gap” or “I would like the iterator to skip to the next row of my matrix here, and then resume iteration”.
For this reason, ndarray
does not directly use standard Rust iterators.
Instead, it uses a homegrown abstraction called
NdProducer
,
which can be lossily converted to a standard iterator.
We will not be leveraging the specifics of ndarray
producers much in this
course, as standard iterators are enough for what we are doing. But I am telling
you about this because it explains why iterating over Arrays
may involve a
different API than iterating over a standard Rust collection, or require an
extra into_iter()
producer-to-iterator conversion step.
In simple cases, however, the conversion will just be done automatically. For
example, here is how one would iterate over 3x3 overlapping windows of an
Array2
using ndarray
:
use ndarray::{array, azip};
let arr = array![[01, 02, 03, 04, 05, 06, 07, 08, 09],
[11, 12, 13, 14, 15, 16, 17, 18, 19],
[21, 22, 23, 24, 25, 26, 27, 28, 29],
[31, 32, 33, 34, 35, 36, 37, 38, 39],
[41, 42, 43, 44, 45, 46, 47, 48, 49]];
for win in arr.windows([3, 3]) {
println!("{win:?}");
}
ndarray
comes with a large set of producers, some of which are more
specialized and optimized than others. It is therefore a good idea to spend some
time looking through the various options that can be used to solve a particular
problem, rather than picking the first producer that works.
Indexing and slicing
Like Vec
s and slices, Array
s and ArrayView
s support indexing and slicing.
And as with Vec
s and slices, it is generally a good idea to avoid using these
operations for performance and correctness reasons, especially inside of
performance-critical loops.
Indexing works using square brackets as usual, the only new thing being that you can pass in an array or a tuple of indices instead of just one index:
use ndarray::Array2;
let mut array = Array2::zeros((4, 3));
array[[1, 1]] = 7;
However, a design oversight in the Rust indexing operator means that it cannot be used for slicing multidimensional arrays. Instead, you will need the following somewhat unclean syntax:
use ndarray::s;
let b = a.slice(s![.., 0..1, ..]);
Notice the use of an s![]
macro for constructing a slicing configuration
object, which is then passed to a slice()
method of the generic ArrayView
object.
This is just one of many slicing methods. Among others, we also get…
slice()
, which createsArrayView
s (analogous to&vec[start..finish]
)slice_mut()
, which createsArrayViewMut
s (analogous to&mut vec[start..finish]
)multi_slice_mut()
, which creates several non-overlappingArrayViewMut
s in a single transaction. This can be used to work around Rust’s “single mutable borrow” rule when it proves to be an unnecessary annoyance.slice_move()
, which consumes the input array or view and returns an owned slice.
Exercise
All ndarray
types natively support a sum()
operation. Compare its
performance to that or your optimized floating-point sum implementation over a
wide range of input sizes.
One thing which will make your life easier is that
Array1
, the owned
one-dimensional array type, can be built from a standard iterator using the
Array::from_iter()
constructor.
…which are a poor fit for the one-dimensional memory architecture of standard computers, and this causes a nearly infinite amount of fun problems that we will not all cover during this course.
Gray-Scott introduction
We are now ready to introduce the final boss computation of this course: the
Gray-Scott reaction simulation. In this chapter, you will be taken through a
rapid tour of the pre-made setup that is provided to you for the purpose of
input initialization, kernel benchmarking, and output HDF5 production. We will
then conclude this tour by showing how one would implement a simple, unoptimized
version of the simulation using ndarray
.
Along the way, you will also get a quick glimpse of how Rust’s structs and methods work. We did not get to explore this area of the language to fit the course’s short format, but in a nutshell they can be used to implement encapsulated objects like C++ classes, and that’s what we do here.
Input initialization
In the Gray-Scott school that this course originates from, the reference C++ implementation of the Gray-Scott simulation would hardcode the initial input state of the simulation. For the sake of keeping things simple and comparable, we will do the same:
use ndarray::Array2;
/// Computation precision
///
/// It is a good idea to make this easy to change in your programs, especially
/// if you care about hardware portability or have ill-specified output
/// precision requirements.
///
/// Notice the "pub" keyword for exposing this to the outside world. All types,
/// functions... are private to the current code module by default in Rust.
pub type Float = f32;
/// Storage for the concentrations of the U and V chemical species
pub struct UV {
pub u: Array2<Float>,
pub v: Array2<Float>,
}
//
/// impl blocks like this let us add methods to types in Rust
impl UV {
/// Set up the hardcoded chemical species concentration
///
/// Notice the `Self` syntax which allows you to refer to the type for which
/// the method is implemented.
fn new(num_rows: usize, num_cols: usize) -> Self {
let shape = [num_rows, num_cols];
let pattern = |row, col| {
(row >= (7 * num_rows / 16).saturating_sub(4)
&& row < (8 * num_rows / 16).saturating_sub(4)
&& col >= 7 * num_cols / 16
&& col < 8 * num_cols / 16) as u8 as Float
};
let u = Array2::from_shape_fn(shape, |(row, col)| 1.0 - pattern(row, col));
let v = Array2::from_shape_fn(shape, |(row, col)| pattern(row, col));
Self { u, v }
}
/// Set up an all-zeroes chemical species concentration
///
/// This can be faster than `new()`, especially on operating systems like
/// Linux where all allocated memory is guaranteed to be initially zeroed
/// out for security reasons.
fn zeroes(num_rows: usize, num_cols: usize) -> Self {
let shape = [num_rows, num_cols];
let u = Array2::zeros(shape);
let v = Array2::zeros(shape);
Self { u, v }
}
/// Get the number of rows and columns of the simulation domain
///
/// Notice the `&self` syntax for borrowing the object on which the method
/// is being called by shared reference.
pub fn shape(&self) -> [usize; 2] {
let shape = self.u.shape();
[shape[0], shape[1]]
}
}
Double buffering
The Gray-Scott reaction is simulated by updating the concentrations of the U and V chemical species many times. This is done by reading the old concentrations of the chemical species from one array, and writing the new concentrations to another array.
We could be creating a new array of concentrations every time we do this, but this would require performing one memory allocation per simulation step, which can be expensive. Instead, it is more efficient to use the double buffering pattern. In this pattern, we keep two versions of the concentration in an array, and on every step of the simulation, we read from one of the array slots and write to the other array slot. Then we flip the role of the array slots for the next simulation step.
We can translate this pattern into a simple encapsulated object…
/// Double-buffered chemical species concentration storage
pub struct Concentrations {
buffers: [UV; 2],
src_is_1: bool,
}
//
impl Concentrations {
/// Set up the simulation state
pub fn new(num_rows: usize, num_cols: usize) -> Self {
Self {
buffers: [UV::new(num_rows, num_cols), UV::zeroes(num_rows, num_cols)],
src_is_1: false,
}
}
/// Get the number of rows and columns of the simulation domain
pub fn shape(&self) -> [usize; 2] {
self.buffers[0].shape()
}
/// Read out the current species concentrations
pub fn current(&self) -> &UV {
&self.buffers[self.src_is_1 as usize]
}
/// Run a simulation step
///
/// The user callback function `step` will be called with two inputs UVs:
/// one containing the initial species concentration at the start of the
/// simulation step, and one to receive the final species concentration that
/// the simulation step is in charge of generating.
///
/// Notice the `&mut self` syntax for borrowing the object on which
/// the method is being called by mutable reference.
pub fn update(&mut self, step: impl FnOnce(&UV, &mut UV)) {
let [ref mut uv_0, ref mut uv_1] = &mut self.buffers;
if self.src_is_1 {
step(uv_1, uv_0);
} else {
step(uv_0, uv_1);
}
self.src_is_1 = !self.src_is_1;
}
}
…and then this object can be used to run the simulation like this:
// Set up the concentrations buffer
let mut concentrations = Concentrations::new(num_rows, num_cols);
// ... other initialization work ...
// Main simulation loop
let mut running = true;
while running {
// Update the concentrations of the U and V chemical species
concentrations.update(|start, end| {
// TODO: Derive new "end" concentration from "start" concentration
end.u.assign(&start.u);
end.v.assign(&start.v);
});
// ... other per-step action, e.g. decide whether to keep running,
// write the concentrations to disk from time to time ...
running = false;
}
// Get the final concentrations at the end of the simulation
let result = concentrations.current();
println!("u is {:#?}", result.u);
println!("v is {:#?}", result.v);
HDF5 output
The reference C++ simulation lets you write down the concentration of the V chemical species to an HDF5 file every N computation steps. This can be used to check that the simulation works properly, or to turn the evolving concentration “pictures” into a video for visualization purposes.
Following its example, we will use the hdf5
Rust
crate1 to write data to HDF5 too, using the same file layout conventions for
interoperability. Here too, we will use an encapsulated object design to keep
things easy to use correctly:
/// Mechanism to write down results to an HDF5 file
pub struct HDF5Writer {
/// HDF5 file handle
file: File,
/// HDF5 dataset
dataset: Dataset,
/// Number of images that were written so far
position: usize,
}
impl HDF5Writer {
/// Create or truncate the file
///
/// The file will be dimensioned to store a certain amount of V species
/// concentration arrays.
///
/// The `Result` return type indicates that this method can fail and the
/// associated I/O errors must be handled somehow.
pub fn create(file_name: &str, shape: [usize; 2], num_images: usize) -> hdf5::Result<Self> {
// The ? syntax lets us propagate errors from an inner function call to
// the caller, when we cannot handle them ourselves.
let file = File::create(file_name)?;
let [rows, cols] = shape;
let dataset = file
.new_dataset::<Float>()
.chunk([1, rows, cols])
.shape([num_images, rows, cols])
.create("matrix")?;
Ok(Self {
file,
dataset,
position: 0,
})
}
/// Write a new V species concentration table to the file
pub fn write(&mut self, result: &UV) -> hdf5::Result<()> {
self.dataset
.write_slice(&result.v, (self.position, .., ..))?;
self.position += 1;
Ok(())
}
/// Flush remaining data to the underlying storage medium and close the file
///
/// This should automatically happen on Drop, but doing it manually allows
/// you to catch and handle I/O errors properly.
pub fn close(self) -> hdf5::Result<()> {
self.file.close()
}
}
After adding this feature, our simulation code skeleton now looks like this:
// Set up the concentrations buffer
let mut concentrations = Concentrations::new(num_rows, num_cols);
// Set up HDF5 I/O
let mut hdf5 = HDF5Writer::create(file_name, concentrations.shape(), num_output_steps)?;
// Produce the requested amount of concentration tables
for _ in 0..num_output_steps {
// Run a number of simulation steps
for _ in 0..compute_steps_per_output_step {
// Update the concentrations of the U and V chemical species
concentrations.update(|start, end| {
// TODO: Derive new "end" concentration from "start" concentration
end.u.assign(&start.u);
end.v.assign(&start.v);
});
}
// Write down the current simulation output
hdf5.write(concentrations.current())?;
}
// Close the HDF5 file
hdf5.close()?;
Reusable simulation skeleton
Right now, our simulation’s update function is a stub that simply copies the input concentrations to the output concentrations without actually changing them. At some point, we are going to need to compute the real updated chemical species concentrations there.
However, we also know from our growing experience with software performance optimization that we are going to need tweak this part of the code a lot. It would be great if we could do this in a laser-focused function that is decoupled from the rest of the code, so that we can easily do things like swapping computation backends and seeing what it changes. As it turns out, a judiciously placed callback interface lets us do just this:
/// Simulation runner options
pub struct RunnerOptions {
/// Number of rows in the concentration table
num_rows: usize,
/// Number of columns in the concentration table
num_cols: usize,
/// Output file name
file_name: String,
/// Number of simulation steps to write to the output file
num_output_steps: usize,
/// Number of computation steps to run between each write
compute_steps_per_output_step: usize,
}
/// Simulation runner, with a user-specified concentration update function
pub fn run_simulation(
opts: &RunnerOptions,
// Notice that we must use FnMut here because the update function can be
// called multiple times, which FnOnce does not allow.
mut update: impl FnMut(&UV, &mut UV),
) -> hdf5::Result<()> {
// Set up the concentrations buffer
let mut concentrations = Concentrations::new(opts.num_rows, opts.num_cols);
// Set up HDF5 I/O
let mut hdf5 = HDF5Writer::create(
&opts.file_name,
concentrations.shape(),
opts.num_output_steps,
)?;
// Produce the requested amount of concentration tables
for _ in 0..opts.num_output_steps {
// Run a number of simulation steps
for _ in 0..opts.compute_steps_per_output_step {
// Update the concentrations of the U and V chemical species
concentrations.update(&mut update);
}
// Write down the current simulation output
hdf5.write(concentrations.current())?;
}
// Close the HDF5 file
hdf5.close()
}
Command-line options
Our simulation has a fair of tuning parameters. To those that we have already
listed in RunnerOptions
, the computational chemistry of the Gray-Scott
reaction requires that we add the following tunable parameters:
- The speed at which V turns into P
- The speed at which U is added to the simulation and U, V and P are removed
- The amount of simulated time that passes between simulation steps
We could just hardcode all these parameters, but doing so would anger the gods of software engineering and break feature parity with the reference C++ version. So instead we will make these parameters configurable via command-line parameters whose syntax and semantics strictly match those of the C++ version.
To this end, we can use the excellent clap
library,
which provides the best API for parsing command line options that I have ever
seen in any programming language.
The first step, as usual, is to clap
as a dependency to our project. We will
also enable the derive
optional feature, which is the key to the
aforementioned nice API:
cargo add --features=derive clap
We will then add some annotations to the definition of our options structs, explaining how they map to the command-line options that our program expects (which follow the syntax and defaults of the C++ reference version for interoperability):
use clap::Args;
/// Simulation runner options
#[derive(Debug, Args)]
pub struct RunnerOptions {
/// Number of rows in the concentration table
#[arg(short = 'r', long = "nbrow", default_value_t = 1080)]
pub num_rows: usize,
/// Number of columns in the concentration table
#[arg(short = 'c', long = "nbcol", default_value_t = 1920)]
pub num_cols: usize,
/// Output file name
#[arg(short = 'o', long = "output", default_value = "output.h5")]
pub file_name: String,
/// Number of simulation steps to write to the output file
#[arg(short = 'n', long = "nbimage", default_value_t = 1000)]
pub num_output_steps: usize,
/// Number of computation steps to run between each write
#[arg(short = 'e', long = "nbextrastep", default_value_t = 34)]
pub compute_steps_per_output_step: usize,
}
/// Simulation update options
#[derive(Debug, Args)]
pub struct UpdateOptions {
/// Speed at which U is added to the simulation and U, V and P are removed
#[arg(short, long, default_value_t = 0.014)]
pub feedrate: Float,
/// Speed at which V turns into P
#[arg(short, long, default_value_t = 0.054)]
pub killrate: Float,
/// Simulated time interval on each simulation step
#[arg(short = 't', long, default_value_t = 1.0)]
pub deltat: Float,
}
We then create a top-level struct which represents our full command-line interface…
use clap::Parser;
/// Gray-Scott reaction simulation
///
/// This program simulates the Gray-Scott reaction through a finite difference
/// schema that gets integrated via the Euler method.
#[derive(Debug, Parser)]
#[command(version)]
pub struct Options {
#[command(flatten)]
runner: RunnerOptions,
#[command(flatten)]
pub update: UpdateOptions,
}
…and in the main function of our final application, we call the automatically
generated parse()
method of that struct and retrieve the parsed command-line
options.
fn main() {
let options = Options::parse();
// ... now do something with "options" ...
}
That’s it. With no extra work, clap
will automatically provide our
simulation with a command-line interface that follows all standard Unix
conventions (e.g. supports both --option value
and --option=value
), handles
user errors, parses argument strings to their respective concrete Rust types,
and prints auto-generated help strings when -h
or --help
is passed.
Also, if you spend 10 more minutes on it, you can make as many of these options
as you want configurable via environment variables too. Which can be convenient
in scenarios where you cannot receive configuration through CLI parameters, like
inside of criterion
microbenchmarks.
Hardcoded parameters
Not all parameters of the C++ reference version are configurable. Some of them are hardcoded, and can only be changed by altering the source code. Since we are aiming for perfect user interface parity with the C++ version, we want to replicate this design in the Rust version.
For now, we will do this by adding a few constants with the hardcoded values to the source code:
#![allow(unused)] fn main() { type Float = f32; /// Weights of the discrete convolution stencil pub const STENCIL_WEIGHTS: [[Float; 3]; 3] = [ [0.25, 0.5, 0.25], [0.5, 0.0, 0.5], [0.25, 0.5, 0.25] ]; /// Offset from the top-left corner of STENCIL_WEIGHTS to its center pub const STENCIL_OFFSET: [usize; 2] = [1, 1]; /// Diffusion rate of the U species pub const DIFFUSION_RATE_U: Float = 0.1; /// Diffusion rate of the V species pub const DIFFUSION_RATE_V: Float = 0.05; }
In Rust, const
items let you declare compile-time constants, much like
constexpr
variables in C++, parameter
s in Fortran, and #define STUFF 123
in C. We do not have the time to dive into the associated language
infrastructure, but for now, all you need to know is that the value of a const
will be copy-pasted on each point of use, which ensures that the compiler
optimizer can specialize the code for the value of the parameter of interest.
Progress reporting
Simulations can take a long time to run. It is not nice to make users wait for
them to run to completion without any CLI output indicating how far along they
are and how much time remains until they are done. Especially when it is very
easy to add such reporting in Rust, thanks to the wonderful
indicatif
library.
To use it, we start by adding the library to our project’s dependencies…
cargo add indicatif
Then in our main function, we create a progress bar with a number of steps matching the number of computation steps…
use indicatif::ProgressBar;
let progress = ProgressBar::new(
(options.runner.num_output_steps
* options.runner.compute_steps_per_output_step) as u64,
);
…we increment it on each computation step…
progress.inc(1);
…and at the end of the simulation, we tell indicatif
that we are done2:
progress.finish();
That’s all we need to add basic progress reporting to our simulation.
Final code layout
This is not a huge amount of code overall, but it does get uncomfortably large
and unfocused for a single code module. So in the exercises
Rust project, the
simulation code has been split over multiple code modules.
We do not have the time to cover the Rust module system in this course, but if you are interested, feel free to skim through the code to get a rough idea of how modularization is done, and ask any question that comes to your mind while doing so.
Microbenchmarks can only access code from the main library (below the src
directory of the project, excluding the bin/
subdirectory), therefore most of
the code lies there. In addition, we have added a simulation binary under
src/bin/simulate.rs
, and a microbenchmark under benches/simulate.rs
.
Exercise
Here is a naïve implementation of a Gray-Scott simulation step implemented using
ndarray
:
use crate::options::{DIFFUSION_RATE_U, DIFFUSION_RATE_V, STENCIL_OFFSET, STENCIL_WEIGHTS};
/// Simulation update function
pub fn update(opts: &UpdateOptions, start: &UV, end: &mut UV) {
// Species concentration matrix shape
let shape = start.shape();
// Iterate over pixels of the species concentration matrices
ndarray::azip!(
(
index (out_row, out_col),
out_u in &mut end.u,
out_v in &mut end.v,
&u in &start.u,
&v in &start.v
) {
// Determine the stencil's input region
let out_pos = [out_row, out_col];
let stencil_start = array2(|i| out_pos[i].saturating_sub(STENCIL_OFFSET[i]));
let stencil_end = array2(|i| (out_pos[i] + STENCIL_OFFSET[i] + 1).min(shape[i]));
let stencil_range = array2(|i| stencil_start[i]..stencil_end[i]);
let stencil_slice = ndarray::s![stencil_range[0].clone(), stencil_range[1].clone()];
// Compute the diffusion gradient for U and V
let [full_u, full_v] = (start.u.slice(stencil_slice).indexed_iter())
.zip(start.v.slice(stencil_slice))
.fold(
[0.; 2],
|[acc_u, acc_v], (((in_row, in_col), &stencil_u), &stencil_v)| {
let weight = STENCIL_WEIGHTS[in_row][in_col];
[acc_u + weight * (stencil_u - u), acc_v + weight * (stencil_v - v)]
},
);
// Deduce the change in U and V concentration
let uv_square = u * v * v;
let du = DIFFUSION_RATE_U * full_u - uv_square + opts.feedrate * (1.0 - u);
let dv = DIFFUSION_RATE_V * full_v + uv_square
- (opts.feedrate + opts.killrate) * v;
*out_u = u + du * opts.deltat;
*out_v = v + dv * opts.deltat;
}
);
}
/// Shorthand for creating a 2D Rust array from an index -> value mapping
fn array2<T>(f: impl FnMut(usize) -> T) -> [T; 2] {
std::array::from_fn(f)
}
Please integrate it into the codebase such that it can is used by both the
simulation binary at src/bin/simulate.rs
and the microbenchmark at
benches/simulate.rs
, without needing to constantly copy and paste code between
the two locations.
Then make sure everything works by running both of them using the following commands:
# Must use -- to separate cargo options from program options
cargo run --release --bin simulate -- -n 5 -e 2
cargo bench --bench simulate
It is expected that the last command will take a few minutes to complete. We are just at the start of our journey, and there’s a lot of optimization work to do. But the set of benchmark configurations is designed to remain relevant by the time where the simulation will be running much, much faster.
Also, starting at this chapter, the exercises are going to get significantly more complex. Therefore, it is a good idea to keep track of old versions of your work and have a way to get back to old versions. To do this, you can turn the exercises codebase into a git repository…
cd ~/exercises
git init
git add --all
git commit -m "Initial commit"
…then save a commit at the end of each chapter, or more generally whenever you feel like you have a codebase state that’s worth keeping around for later.
git add --all
git commit -m "<Describe your code changes here>"
More precisely the
hdf5-metno
fork of that
hdf5
crate, because the author of the original crate sadly ceased
maintenance without handing off his crates.io push rights to someone
else…
This step is needed because indicatif
allows you to add more work to the
progress bar.
Regularizing
If you ran a profiler on the initial code that you were provided with, you would find that it spends an unacceptable amount of time computing which indices of the concentration tables should be targeted by the input stencil, and slicing the input tables at these locations.
In other words, this part of the code is the bottleneck:
// Determine the stencil's input region
let out_pos = [out_row, out_col];
let stencil_start = array2(|i| out_pos[i].saturating_sub(STENCIL_OFFSET[i]));
let stencil_end = array2(|i| (out_pos[i] + STENCIL_OFFSET[i] + 1).min(shape[i]));
let stencil_range = array2(|i| stencil_start[i]..stencil_end[i]);
let stencil_slice = ndarray::s![stencil_range[0].clone(), stencil_range[1].clone()];
// Compute the diffusion gradient for U and V
let [full_u, full_v] = (start.u.slice(stencil_slice).indexed_iter())
.zip(start.v.slice(stencil_slice))
.fold(/* ... proceed with the computation ... */)
We have seen, however, that ndarray
provides us with an optimized sliding
window iterator called windows()
. One obvious next steps would be to use this
iterator instead of doing all the indexing ourselves. This is not as easy as it
seems, but getting there will be the purpose of this chapter.
The boundary condition problem
Our Gray-Scott reaction simulation is a member of a larger family or numerical computations called stencil computations. What these computations all have in common is that their output at one particular spatial location depends on a weighted average of the neighbours of this spatial location in the input table. And therefore, all stencil computations must address one common concern: what should be done when there is no neighbour, on the edges or corners of the simulation domain?
In this course, we use a zero boundary condition. That is to say, we extend the simulation domain by saying that if we need to read a chemical species’ concentration outside of the simulation domain, the read will always return zero. And the way we implement this policy in code is that we do not do the computation at all for these stencil elements. This works because multiplying one of the stencil weights by zero will return zero, and therefore the associated contribution to the final weighted sum will be zero, as if the associated stencil elements were not taken into account to begin with.
Handling missing values like this is a common choice, and there is nothing wrong
with it per se. However, it means that we cannot simply switch to ndarray
’s
windows iterator by changing our simulation update loop into something like this:
ndarray::azip!(
(
out_u in &mut end.u,
out_v in &mut end.v,
win_u in start.u.windows([3, 3]),
win_v in start.v.windows([3, 3]),
) {
// TODO: Adjust the rest of the computation to work with these inputs
}
);
The reason why it does not work is that for a 2D array of dimensions NxM, iterating over output elements will produce all NxM elements, whereas iterating over 3x3 windows will only produce (N-2)x(M-2) valid input windows. Therefore, the above computation loop is meaningless and if you try to work with it anyway, you will inevitably produce incorrect results.
There are two classic ways to resolve this issue:
- We can make our data layout more complicated by resizing the concentration tables to add a strip of zeroes all around the actually useful data, and be careful never to touch these zeroes so that they keep being zeros.
- We can make our update loop more complcated by splitting it into two parts, one which processes the center of the simulation domain with optimal efficiency, and one which processes the edge at a reduced efficiency.
Both approaches have their merits, and at nontrivial problem size they have equivalent performance. The first approach makes the update loop simpler, but the second approach avoids polluting the rest of the codebase1 with edge element handling concerns. Knowing this, it is up to you to choose where you should spend your code complexity budget.
Optimizing the central iterations
The azip
loop above was actually almost right for computing the central
concentration values. It only takes a little bit of extra slicing to make it
correct:
let shape = start.shape();
let center = ndarray::s![1..shape[0]-1, 1..shape[1]-1];
ndarray::azip!(
(
out_u in end.u.slice_mut(center),
out_v in end.v.slice_mut(center),
win_u in start.u.windows([3, 3]),
win_v in start.v.windows([3, 3]),
) {
// TODO: Adjust the rest of the computation to work with these inputs
}
);
With this change, we now know that the computation will always work on 3x3 input windows, and therefore we can dramatically simplify the per-iteration code:
let shape = start.shape();
let center = s![1..shape[0]-1, 1..shape[1]-1];
ndarray::azip!(
(
out_u in end.u.slice_mut(center),
out_v in end.v.slice_mut(center),
win_u in start.u.windows([3, 3]),
win_v in start.v.windows([3, 3]),
) {
// Get the center concentration
let u = win_u[STENCIL_OFFSET];
let v = win_v[STENCIL_OFFSET];
// Compute the diffusion gradient for U and V
let [full_u, full_v] = (win_u.into_iter())
.zip(win_v)
.zip(STENCIL_WEIGHTS.into_iter().flatten())
.fold(
[0.; 2],
|[acc_u, acc_v], ((&stencil_u, &stencil_v), weight)| {
[acc_u + weight * (stencil_u - u), acc_v + weight * (stencil_v - v)]
},
);
// Rest of the computation is unchanged
let uv_square = u * v * v;
let du = DIFFUSION_RATE_U * full_u - uv_square + opts.feedrate * (1.0 - u);
let dv = DIFFUSION_RATE_V * full_v + uv_square
- (opts.feedrate + opts.killrate) * v;
*out_u = u + du * opts.deltat;
*out_v = v + dv * opts.deltat;
}
);
You will probably not be surprised to learn that in addition to being much easier to read and maintain, this Rust code will also compile down to much faster machine code.
But of course, it does not fully resolve the problem at hand, as we are not computing the edge values of the chemical species concentration correctly. We are going to need either a separate code paths or a data layout change to get there.
Exercise
For this exercise, we give you two possible strategies:
- Write separate code to handle the boundary values using the logic of the initial naive code. If you choose this path, keep around the initial stencil update loop in addition to the regularized loop above, you will need it to handle the edge values.
- Change the code that allocates the simulation’s data storage and writes
output down to HDF5 in order to allocate one extra element on each side of
the concentration arrays. Keep these elements equal to zero throughout the
computation, and use a
center
slice analogous to the one above in order to only emit results in the relevant region of the concentration array.
If you are undecided, I would advise going for the first option, as the resulting regular/irregular code split will give you an early taste of things to come in the next chapter.
Including, in larger numerical codebases, code that you may have little control over.
Basic SIMD
By now, we have largely resolved the slicing-induced performance bottleneck that plagued our initial naive simulation code. But as we have seen in the introductory chapters, there is still much to do before our Rust code puts hardware to good use.
In particular, we are not using SIMD yet, for reasons that will become clear once you realize that we are effectively computing one tiny dot product per output concentration value. As you know from earlier chapters of this course, compilers cannot translate naive dot product computations into efficient SIMD code, we must perform the translation for them.
This is the first thing that we will need to fix in order to use our hardware more efficiently.
Picking a granularity
This computation is complex enough that there are actually two different ways to vectorize it:
- We could try to internally vectorize the computation of a single (u, v) concentration value pair. The diffusion gradient computation, which is basically a dot product, would be an obvious candidate for vectorization as we already know how to compute SIMD dot products.
- On each iteration of the loop, we could try to compute not just one pair of concentration values, but one full SIMD vector of concentration values for each chemical species. Basically, anywhere the current code manipulates a single floating-point number, the new code would manipulate a full SIMD vector of floating-point numbers instead.
Are there any reasons to prefer one of these two options over the other? Indeed there are:
- Option 1 only lets us vectorize a subset of the computation, while option 2 lets us vectorize the full computation. Because Amdhal’s law is a thing, this is an argument in favor of option 2.
- Option 1 involves performing SIMD horizontal reductions (going from a SIMD vector of inputs to a single floating-point output) in a tight loop. In contrast, option 2 involves no horizontal reduction. Because horizontal reductions are slow, this is an argument in favor of option 2.
- Option 2 only makes sense if the entire computation is done using SIMD. This
will consume more SIMD execution resources (CPU SIMD registers, etc) and make
our life difficult in the presence of constructs that do not map well to SIMD
like
if/else
and loop early exits. Therefore, on certain hardware architectures or for sufficiently complex code, only option 1 (fine-grained SIMD) may be available.
In this school, we are using x86 CPUs, which perform scalar floating-point operations using the same CPU resources as SIMD operations. Therefore, switching to SIMD will not change the CPU resource usage profile, and there is no risk of blowing a CPU hardware budget that we used to fit in before.
And on the code side, our computation is simple enough that translating it to SIMD operations is not super-difficult. So overall, for this computation, option 2 (coarse-grained SIMD) is the clear winner.
Visualizing the goal
As it turns out, the simplest way we can go about introducing SIMD in this computation is to make our input and output windows wider, as we will now visually demonstrate.
Currently, we are jointly iterating over input windows and output values as in the following sketch…
…where the gray rectangle represents the overall dataset, the blue square represents the location of the input values that we are reading at each step of the update loop, and the red square represents the location of the output value that we are writing at each step of the update loop.
Our goal in this chapter will be to instead iterate over wider (but equally high) input and output slices, where the output region is as wide as a hardware SIMD vector, and the input region follows by adding one data point on each side.
Once we get there, introducing SIMD will “just” be a matter of replacing each scalar operation in our algorithm with a SIMD operation that targets a SIMD vector starting at the same memory location.
Breaking the symmetry
This move to wider input and output windows is not without consequences, however. It breaks the symmetry between rows and columns that has existed so far in our computation, allowing us to perform 2D iteration over the dataset with a single, 1D-feeling loop.
Now we are going to need an outer loop over lines of output…
…and, under that outer loop, an inner loop over SIMD vectors within each line.
We are also going to need to decide how to handle the case where the number of output elements within a line is not a multiple of the SIMD vector length:
- Do we simply forbid this to keep our code simple, at the cost of making users angry?
- Do we handle the remaining elements of each line using a scalar computation?
- Do we start by slicing up a “regular” part of the computation that has all the right properties for SIMD, processing the rest (including the edges of the simulation domain) using a more general but slower scalar implementation?
There is no single right answer here, and the right way to go about this will depend on the technical and political specifics of the software that you are writing. But option 3 is a good tradeoff if you are unsure, and may therefore be a good starting point.
The “easy” part
We will now start implementing a subset of the update loop that only works on the part of the simulation domain that is easy to handle using SIMD.
First of all, assuming a pre-existing SIMD_WIDTH
constant which contains our
SIMD vector width, we can select an output region that covers all previous
output rows, but only a number of columns that corresponds to an integer number
of SIMD vectors:
use ndarray::s;
let [num_rows, num_cols] = start.shape();
let num_regular_cols = ((num_cols - 2) / SIMD_WIDTH) * SIMD_WIDTH;
let regular_output = s![1..(num_rows - 1), 1..=num_regular_cols];
let mut regular_out_u = end.u.slice_mut(regular_output);
let mut regular_out_v = end.v.slice_mut(regular_output);
let regular_input = s![0..num_rows, 0..=(num_regular_cols + 1)];
let regular_in_u = start.u.slice(regular_input);
let regular_in_v = start.v.slice(regular_input);
We can then iterate over rows of the output arrays and the corresponding windows of three consecutive input rows in the input arrays…
use ndarray::Axis;
let out_rows = (regular_out_u.rows_mut().into_iter()).zip(regular_out_v.rows_mut());
let in_windows = (regular_in_u.axis_windows(Axis(0), 3).into_iter())
.zip(regular_in_v.axis_windows(Axis(0), 3));
// Cannot use `ndarray::azip` here because it does not support jointly
// iterating over ArrayViews of different dimensionality (here 1D and 2D)
for ((mut out_row_u, mut out_row_v), (win_u, win_v)) in out_rows.zip(in_windows) {
// TODO: Process a row of output here
}
…and within that outer loop, we can have an inner loop on SIMD-sized chunks within each row of the output matrix, along with the corresponding input windows.
let out_chunks = (out_row_u.exact_chunks_mut(SIMD_WIDTH).into_iter())
.zip(out_row_v.exact_chunks_mut(SIMD_WIDTH));
let in_windows = (win_u
.axis_windows(Axis(1), SIMD_WIDTH + 2)
.into_iter()
.step_by(SIMD_WIDTH))
.zip(
win_v
.axis_windows(Axis(1), SIMD_WIDTH + 2)
.into_iter()
.step_by(SIMD_WIDTH),
);
// Cannot use `ndarray::azip` here for the same reason as before
for ((mut out_chunk_u, mut out_chunk_v), (win_u, win_v)) in out_chunks.zip(in_windows) {
// TODO: Process a SIMD-sized chunk of output here
}
Finally, within the body of that inner loop, we can introduce a SIMD version of
our regularized update algorithm, using std::simd
:
use crate::data::Float;
use std::simd::prelude::*;
// Access the SIMD data corresponding to the center concentration
let simd_input = s![
STENCIL_OFFSET[0],
STENCIL_OFFSET[1]..SIMD_WIDTH + STENCIL_OFFSET[1],
];
let u = win_u.slice(simd_input);
let v = win_v.slice(simd_input);
// Load it as a SIMD vector
//
// The conversion from ArrayView to slice can fail if the data is
// not contiguous in memory. In this case, we know it should always
// be contiguous, so we can use unwrap() which panics otherwise.
type Vector = Simd<Float, SIMD_WIDTH>;
let u = Vector::from_slice(u.as_slice().unwrap());
let v = Vector::from_slice(v.as_slice().unwrap());
// Compute the diffusion gradient for U and V
let [full_u, full_v] = (win_u.windows([1, SIMD_WIDTH]).into_iter())
.zip(win_v.windows([1, SIMD_WIDTH]))
.zip(STENCIL_WEIGHTS.into_iter().flatten())
.fold(
[Vector::splat(0.); 2],
|[acc_u, acc_v], ((stencil_u, stencil_v), weight)| {
let stencil_u = Vector::from_slice(stencil_u.as_slice().unwrap());
let stencil_v = Vector::from_slice(stencil_v.as_slice().unwrap());
let weight = Vector::splat(weight);
[
acc_u + weight * (stencil_u - u),
acc_v + weight * (stencil_v - v),
]
},
);
// Compute SIMD versions of all the float constants that we use
let diffusion_rate_u = Vector::splat(DIFFUSION_RATE_U);
let diffusion_rate_v = Vector::splat(DIFFUSION_RATE_V);
let feedrate = Vector::splat(opts.feedrate);
let killrate = Vector::splat(opts.killrate);
let deltat = Vector::splat(opts.deltat);
let ones = Vector::splat(1.0);
// Compute the output values of u and v
let uv_square = u * v * v;
let du = diffusion_rate_u * full_u - uv_square + feedrate * (ones - u);
let dv = diffusion_rate_v * full_v + uv_square - (feedrate + killrate) * v;
let out_u = u + du * deltat;
let out_v = v + dv * deltat;
// Store the output values of u and v
out_u.copy_to_slice(out_chunk_u.as_slice_mut().unwrap());
out_v.copy_to_slice(out_chunk_v.as_slice_mut().unwrap());
The main highlights here are that
- We can convert back from an
ArrayView
that has contiguous storage to a standard Rust slice for the purpose of interacting with the SIMD API, which doesn’t know aboutndarray
. - We can still use 1D-like iteration in our inner diffusion gradient loop, it is only the outer loops on output elements that are affected by the change of algorithm.
- Floating point constants can be turned into SIMD vectors whose elements are
all equal to the constant by using
Simd::splat
, which on x86 maps to the hardwarebroadcastss
instruction. - Adding SIMD to any nontrivial codebase without turning it into an unreadable mess is a major software engineering challenge1.
Finally, because we are using the experimental nightly-only std::simd
API, we
will need to enable it by adding the associated #![feature(portable_simd)]
directive to the top of exercises/src/lib.rs
(assuming this is where you have
put the update function’s implementation).
Exercise
Make a backup of your current update function, you will need some of it to handle the irregular subset of the data (simulation domain edges, extra columns that do not fit a SIMD vector nicely, etc).
Then integrate the above regularized SIMD algorithm into your code, and complete
it by adding function multiversioning through the multiversion
crate (as
presented in the SIMD chapter), so that you get something to put
inside of this SIMD_WIDTH
constant:
const SIMD_WIDTH: usize = ...;
Finally, make your update function handle the irregular part of the simulation domain by reusing your former implementation.
Though we will see in the next chapter that SIMD can be made both easier and more efficient, if we are willing to sacrifice any hope of interoperability with other code and rearrange data into a highly non-obvious layout.
Advanced SIMD
The SIMD version of the Gray-Scott simulation that we have implemented in the previous chapter has two significant issues that would be worth improving upon:
- The SIMD update loop is significantly more complex than the regularized scalar version.
- Its memory accesses suffer from alignment issues that will reduce runtime performance. On x86 specifically, at least 1 in 4 memory access1 will be slowed down in SSE builds, 1 in 2 in AVX builds, and every memory access is in AVX-512 builds. So the more advanced your CPU’s SIMD instruction set, the worse the relative penalty becomes.
Interestingly, there is a way to improve the code on both of these dimensions, and simplify the update loop immensely while improving its runtime performance. However, there is no free lunch in programming, and we will need to pay a significant price in exchange for these improvements:
- The layout of numerical data in RAM will become less obvious and harder to reason about.
- Interoperability with other code that expects a more standard memory layout, such as the HDF5 I/O library, will become more difficult and have higher runtime overhead.
- We will need to give up on the idea that we can allow users to pick a simulation domain of any size they like, and enforce some hardware-originated constraints on the problem size.
Assuming these are constraints that you can live with, let us now get started!
A new data layout
So far, we have been laying out our 2D concentration data in the following straightforward manner:

Now let us assume that we want to use SIMD vectors of width 3. We will start by slicing our scalar data into three equally tall horizontal slices, one per SIMD vector lane…

…and then we will shuffle around the data such that if we read it from left to right, we now get the first column of the former top block, followed by the first column of the middle block, followed by the first column of the bottom block, followed by the second column of the top block, and so on:
At this point, you may reasonably be skeptical that this is an improvement. But before you start doubting my sanity, let us look at the same data layout again, with a different visualization that emphasizes aligned SIMD vectors:
And now, let us assume that we want to compute the output aligned SIMD vector at coordinate (1, 1) from the top-left corner in this SIMD super-structure:
As it turns out, we can do it using a read and write pattern that looks exactly like the regularized scalar stencil that we have used before, but this time using correctly aligned SIMD vectors instead of scalar data points. Hopefully, by now, you will agree that we are getting somewhere.
Now that we have a very efficient data access pattern at the center of the simulation domain, let us reduce the need for special edge handling in order to make our simulation update loop even simpler. The left and right edges are easy, as we can just add zeroed out vectors on both sides and be careful not to overwrite them later in the computation…
…but the top and bottom edges need more care. The upper neighbours of scalar elements at coordinates 1x is 0 and the lower neighbours of scalar elements at coordinates 9x are easy because they can also be permanently set to zero:
However, other top and bottom scalar elements will somehow need to wrap around the 2D array and shift by one column in order to access their upper and lower neighbors:
There are several ways to implement this wrapping around and column-shifting in a SIMD computation. In this course, we will use the approach of updating the upper and lower rows of the array after each computation step in order to keep them in sync with the matching rows at the opposite end of the array. This wrapping around and shifting will be done using an advanced family of SIMD instructions known as swizzles.
Even though SIMD swizzles are relatively expensive CPU instructions, especially on x86, their overhead will be imperceptible for sufficiently tall simulation domains. That’s because we will only need to use them in order to update the top and bottom rows of the simulation, no matter how many rows the simulation domain has, whereas the rest of the simulation update loop has a computational cost that grows linearly with the number of simulation domain rows.
Adjusting the code
Data layout changes can be an extremely powerful tool to optimize your code. And generally speaking there will always be a limit to how much performance you can gain without changing how data is laid out in files and in RAM.
But the motto of this course is that there is no free lunch in programming, and in this particular case, the price to pay for this change will be a major code rewrite to adapt almost every piece of code that we have previously written to the new data layout.2
U and V concentration storage
Our core UV
struct has so far stored scalar data, but now we will make it
store SIMD vectors instead. We want to do it for two reasons:
- It will make our SIMD update loop code much simpler and clearer.
- It will tell the compiler that we want to do SIMD with a certain width and the associated memory allocation should be correctly aligned for this purpose.
The data structure definition therefore changes to this:
use std::simd::{prelude::*, LaneCount, SupportedLaneCount};
/// Type alias for a SIMD vector of floats, we're going to use this type a lot
pub type Vector<const SIMD_WIDTH: usize> = Simd<Float, SIMD_WIDTH>;
/// Storage for the concentrations of the U and V chemical species
pub struct UV<const SIMD_WIDTH: usize>
where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
pub u: Array2<Vector<SIMD_WIDTH>>,
pub v: Array2<Vector<SIMD_WIDTH>>,
}
Notice that the UV
struct definition must now be generic over the SIMD vector
width (reflecting the associated change in memory alignement in the underlying
Array2
s), and that this genericity must be bounded by a where
clause3
to indicate that not all usize
s are valid SIMD vector widths.4
The methods of the UV
type change as well, since now the associated impl
block must be made generic as well, and the implementation of the various
methods must adapt to the fact that now our inner storage is made of SIMD
vectors.
First the impl
block will gain generics parameters, with bounds that match
those of the source type:
impl<const SIMD_WIDTH: usize> UV<SIMD_WIDTH>
where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
// TODO: Add methods here
}
The reason why the impl
blocks needs generics bounds as well is that this
gives the language syntax headroom to let you add methods that are specific to
one specific value of the generics parameters, like this:
impl UV<4> {
// TODO: Add code specific to vectors of width 4 here
}
That being said, repeating bounds like this is certainly annoying in the common case, and there is a longstanding desire to add a way to tell the language “please just repeat the bounds of the type definitions”, using syntax like this:
impl UV<_> {
// TODO: Add generic code that works for all SIMD_WIDTHS here
}
The interested reader is advised to use “implied bounds” as a search engine keyword, in order to learn more about how this could work, and why integrating this feature into Rust has not been as easy as initially envisioned.
The main UV::new()
constructor changes a fair bit because it needs to account
for the fact that…
- There is now one extra SIMD vector on each side of the simulation domain
- The SIMD-optimized data layout is quite different, and mapping our original rectangular concentration pattern to it is non-trivial.
- To simplify this version of the code, we set the constraint that both the height and the width of the simulation domain must be a multiple of the SIMD vector width.
…which, when taken together, leads to this implementation:
fn new(num_scalar_rows: usize, num_scalar_cols: usize) -> Self {
// Enforce constraints of the new data layout
assert!(
(num_scalar_rows % SIMD_WIDTH == 0) && (num_scalar_cols % SIMD_WIDTH == 0),
"num_scalar_rows and num_scalar_cols must be a multiple of the SIMD vector width"
);
let num_center_rows = num_scalar_rows / SIMD_WIDTH;
let simd_shape = [num_center_rows + 2, num_scalar_cols + 2];
// Predicate which selects central rows and columns
let is_center = |simd_row, simd_col| {
simd_row >= 1
&& simd_row <= num_center_rows
&& simd_col >= 1
&& simd_col <= num_scalar_cols
};
// SIMDfied version of the scalar pattern, based on mapping SIMD vector
// position and SIMD lane indices to equivalent positions in the
// original scalar array.
//
// Initially, we zero out all edge elements. We will fix the top/bottom
// elements in a later step.
let pattern = |simd_row, simd_col| {
let elements: [Float; SIMD_WIDTH] = if is_center(simd_row, simd_col) {
std::array::from_fn(|simd_lane| {
let scalar_row = simd_row - 1 + simd_lane * num_center_rows;
let scalar_col = simd_col - 1;
(scalar_row >= (7 * num_scalar_rows / 16).saturating_sub(4)
&& scalar_row < (8 * num_scalar_rows / 16).saturating_sub(4)
&& scalar_col >= 7 * num_scalar_cols / 16
&& scalar_col < 8 * num_scalar_cols / 16) as u8 as Float
})
} else {
[0.0; SIMD_WIDTH]
};
Vector::from(elements)
};
// The next steps are very similar to the scalar version...
let u = Array2::from_shape_fn(simd_shape, |(simd_row, simd_col)| {
if is_center(simd_row, simd_col) {
Vector::splat(1.0) - pattern(simd_row, simd_col)
} else {
Vector::splat(0.0)
}
});
let v = Array2::from_shape_fn(simd_shape, |(simd_row, simd_col)| {
pattern(simd_row, simd_col)
});
let mut result = Self { u, v };
// ...except we must fix up the top and bottom rows of the simulation
// domain in order to achieve the intended data layout.
result.update_top_bottom();
result
}
Notice the call to the new update_top_bottom()
method, which is in charge of
calculating the top and bottom rows of the concentration array. We will get back
to how this method works in a bit.
The all-zeroes constructor changes very little in comparison, since when everything is zero, the order of scalar elements within the array does not matter:
/// Set up an all-zeroes chemical species concentration
fn zeroes(num_scalar_rows: usize, num_scalar_cols: usize) -> Self {
// Same idea as above
assert!(
(num_scalar_rows % SIMD_WIDTH == 0) && (num_scalar_cols % SIMD_WIDTH == 0),
"num_scalar_rows and num_scalar_cols must be a multiple of the SIMD vector width"
);
let num_simd_rows = num_scalar_rows / SIMD_WIDTH;
let simd_shape = [num_simd_rows + 2, num_scalar_cols + 2];
let u = Array2::default(simd_shape);
let v = Array2::default(simd_shape);
Self { u, v }
}
The notion of shape
becomes ambiguous in the new layout, because we need to
clarify whether we are talking about the logical size of the simulation domain
in scalar concentration data points, or its physical size in SIMD vector
elements. Therefore, the former shape()
method is split into two methods, and
callers must be adapted to call the right method for their needs:
/// Get the number of rows and columns of the SIMD simulation domain
pub fn simd_shape(&self) -> [usize; 2] {
let shape = self.u.shape();
[shape[0], shape[1]]
}
/// Get the number of rows and columns of the scalar simulation domain
pub fn scalar_shape(&self) -> [usize; 2] {
let [simd_rows, simd_cols] = self.simd_shape();
[(simd_rows - 2) * SIMD_WIDTH, simd_cols - 2]
}
…and finally, we get to discuss the process through which the top and bottom rows of the SIMD concentration array are updated:
use multiversion::multiversion;
use ndarray::s;
// ...
/// Update the top and bottom rows of all inner arrays of concentrations
///
/// This method must be called between the end of a simulation update step
/// and the beginning of the next step to sync up the top/bottom rows of the
/// SIMD data store. It can also be used to simplify initialization.
fn update_top_bottom(&mut self) {
// Due to a combination of language and compiler limitations, we
// currently need both function multiversioning and genericity over SIMD
// width here. See handouts for the full explanation.
#[multiversion(targets("x86_64+avx2+fma", "x86_64+avx", "x86_64+sse2"))]
fn update_array<const WIDTH: usize>(arr: &mut Array2<Vector<WIDTH>>)
where
LaneCount<WIDTH>: SupportedLaneCount,
{
// TODO: Implementation for one concentration array goes here
}
update_array(&mut self.u);
update_array(&mut self.v);
}
First of all, we get this little eyesore of an inner function declaration:
#[multiversion(targets("x86_64+avx2+fma", "x86_64+avx", "x86_64+sse2"))]
fn update_array<const WIDTH: usize>(arr: &mut Array2<Vector<WIDTH>>)
where
LaneCount<WIDTH>: SupportedLaneCount,
{
// TODO: Top/bottom row update goes here
}
It needs to have a multiversion
attribute AND a generic signature due to a
combination of language and compiler limitations:
- Rust currently does not allow inner function declarations to access the
generic parameters of outer functions. Therefore, the inner function must be
made generic over
WIDTH
even though it will only be called for one specificSIMD_WIDTH
. - Genericity over
SIMD_WIDTH
is not enough to achieve optimal SIMD code generation, because by default, the compiler will only generate code for the lowest-common-denominator SIMD instruction set (SSE2 on x86_64), emulating wider vector widths using narrower SIMD operations. We still need function multi-versioning in order to generate one optimized code path per supported SIMD instruction set, and dispatch to the right code path at runtime.
Beyond that, the implementation is quite straightforward. First we extract the
two top and bottom rows of the concentration array using ndarray
slicing,
ignoring the leftmost and rightmost element that we know to be zero…
// Select the top and bottom rows of the simulation domain
let shape = arr.shape();
let [num_simd_rows, num_cols] = [shape[0], shape[1]];
let horizontal_range = 1..(num_cols - 1);
let row_top = s![0, horizontal_range.clone()];
let row_after_top = s![1, horizontal_range.clone()];
let row_before_bottom = s![num_simd_rows - 2, horizontal_range.clone()];
let row_bottom = s![num_simd_rows - 1, horizontal_range];
// Extract the corresponding data slices
let (row_top, row_after_top, row_before_bottom, row_bottom) =
arr.multi_slice_mut((row_top, row_after_top, row_before_bottom, row_bottom));
…and then we iterate over all SIMD vectors within the rows, generating “lane-shifted” versions using a combination of SIMD rotates and zero-assignment (which have wider hardware support than lane-shifting), and storing them on the opposite end of the array:
// Jointly iterate over all rows
ndarray::azip!((
vec_top in row_top,
&mut vec_after_top in row_after_top,
&mut vec_before_bottom in row_before_bottom,
vec_bottom in row_bottom
) {
// Top vector acquires the value of the before-bottom vector,
// rotated right by one lane and with the first element set to zero
let mut shifted_before_bottom = vec_before_bottom.rotate_elements_right::<1>();
shifted_before_bottom[0] = 0.0;
*vec_top = shifted_before_bottom;
// Bottom vector acquires the value of the after-top vector, rotated
// left by one lane and with the last element set to zero
let mut shifted_after_top = vec_after_top.rotate_elements_left::<1>();
shifted_after_top[WIDTH - 1] = 0.0;
*vec_bottom = shifted_after_top;
});
Double buffering and the SIMD-scalar boundary
Because our new concentration storage has become generic over the width of the SIMD instruction set in use, our double buffer abstraction must become generic as well:
pub struct Concentrations<const SIMD_WIDTH: usize>
where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
buffers: [UV<SIMD_WIDTH>; 2],
src_is_1: bool,
}
However, now is a good time to ask ourselves where we should put the boundary of code which must know about this genericity and the associated SIMD data storage. We do not want to simply propagate SIMD types everywhere because…
- It would make all the code generic, which as we have seen is a little annoying, and also slows compilation down (all SIMD-generic code is compiled once per possible hardware vector width).
- Some of our dependencies like
hdf5
are lucky enough not to know or care about SIMD data types. At some point, before we hand over data to these dependencies, a conversion to the standard scalar layout will need to be performed.
Right now, the only point where the Concentrations::current()
method is used
is when we hand over the current value of V species concentration array to HDF5
for the purpose of writing it out. Therefore, it is reasonable to use this
method as our SIMD-scalar boundary by turning it into a current_v()
method
that returns a scalar view of the V concentration array, which is computed on
the fly whenever requested.
We can prepare for this by adding a new scalar array member to the
Concentrations
struct…
pub struct Concentrations<const SIMD_WIDTH: usize>
where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
buffers: [UV<SIMD_WIDTH>; 2],
src_is_1: bool,
scalar_v: Array2<Float>, // <- New
}
…and zero-initializing it in the Concentrations::new()
constructor:
/// Set up the simulation state
pub fn new(num_scalar_rows: usize, num_scalar_cols: usize) -> Self {
Self {
buffers: [
UV::new(num_scalar_rows, num_scalar_cols),
UV::zeroes(num_scalar_rows, num_scalar_cols),
],
src_is_1: false,
scalar_v: Array2::zeros([num_scalar_rows, num_scalar_cols]), // <- New
}
}
And finally, we must turn current()
into current_v()
, make it take &mut self
instead of &self
so that it can mutate the internal buffer, and make it
return a reference to the internal buffer:
/// Read out the current V species concentration
pub fn current_v(&mut self) -> &Array2<Float> {
let simd_v = &self.buffers[self.src_is_1 as usize].v;
// TODO: Compute scalar_v from simd_v
&self.scalar_v
}
The rest of the double buffer does not change much, it is just a matter of…
- Adding generics in the right place
- Exposing new API distinctions that didn’t exist before between the scalar domain shape and the SIMD domain shape
- Updating the top and bottom rows of the SIMD dataset after each update.
impl<const SIMD_WIDTH: usize> Concentrations<SIMD_WIDTH>
where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
/// Set up the simulation state
pub fn new(num_scalar_rows: usize, num_scalar_cols: usize) -> Self {
Self {
buffers: [
UV::new(num_scalar_rows, num_scalar_cols),
UV::zeroes(num_scalar_rows, num_scalar_cols),
],
src_is_1: false,
scalar_v: Array2::zeros([num_scalar_rows, num_scalar_cols]),
}
}
/// Get the number of rows and columns of the SIMD simulation domain
pub fn simd_shape(&self) -> [usize; 2] {
self.buffers[0].simd_shape()
}
/// Get the number of rows and columns of the scalar simulation domain
pub fn scalar_shape(&self) -> [usize; 2] {
self.buffers[0].scalar_shape()
}
/// Read out the current V species concentration
pub fn current_v(&mut self) -> &Array2<Float> {
let simd_v = &self.buffers[self.src_is_1 as usize].v;
// TODO: Compute scalar_v from simd_v
&self.scalar_v
}
/// Run a simulation step
///
/// The user callback function `step` will be called with two inputs UVs:
/// one containing the initial species concentration at the start of the
/// simulation step, and one to receive the final species concentration that
/// the simulation step is in charge of generating.
///
/// The `step` callback needs not update the top and the bottom rows of the
/// SIMD arrays, it will be updated automatically.
pub fn update(&mut self, step: impl FnOnce(&UV<SIMD_WIDTH>, &mut UV<SIMD_WIDTH>)) {
let [ref mut uv_0, ref mut uv_1] = &mut self.buffers;
if self.src_is_1 {
step(uv_1, uv_0);
uv_0.update_top_bottom();
} else {
step(uv_0, uv_1);
uv_1.update_top_bottom();
}
self.src_is_1 = !self.src_is_1;
}
}
But of course, we do need to take care of that TODO
within the implementation
of current_v()
.
Going back to the scalar data layout
Let us go back to the virtual drawing board and fetch back our former schematic that illustrates the mapping between the SIMD and scalar data layout:
It may not be clear from a look at it, but given a set of SIMD_WIDTH
consecutive vectors of SIMD data, it is possible to efficiently reconstruct a
SIMD_WIDTH
vectors of scalar data.
In other words, there are reasonably efficient SIMD instructions for performing this transformation:
Unfortunately, the logic of these instructions is highly hardware-dependent,
and Rust’s std::simd
has so far shied away from implementing higher-level
abstract operations that involve more than two input SIMD vectors. Therefore, we
will have to rely on autovectorization for this as of today.
Also, to avoid complicating the code with a scalar fallback, we should
enforce that the number of columns in the underlying scalar array is a multiple
of SIMD_WIDTH
. Which is reasonable given that we already enforce this for the
number of rows. This is already done in the UV
constructors that you have been
provided with above.
But given these two prerequisites, here is a current_v()
implementation that
does the job:
use ndarray::ArrayView2;
// ...
/// Read out the current V species concentration
pub fn current_v(&mut self) -> &Array2<Float> {
// Extract the center region of the V input concentration table
let uv = &self.buffers[self.src_is_1 as usize];
let [simd_rows, simd_cols] = uv.simd_shape();
let simd_v_center = uv.v.slice(s![1..simd_rows - 1, 1..simd_cols - 1]);
// multiversion does not support methods that take `self` yet, so we must
// use an inner function for now.
#[multiversion(targets("x86_64+avx2+fma", "x86_64+avx", "x86_64+sse2"))]
fn make_scalar<const WIDTH: usize>(
simd_center: ArrayView2<Vector<WIDTH>>,
scalar_output: &mut Array2<Float>,
) where
LaneCount<WIDTH>: SupportedLaneCount,
{
// Iterate over SIMD rows...
let simd_center_rows = simd_center.nrows();
for (simd_row_idx, row) in simd_center.rows().into_iter().enumerate() {
// ...and over chunks of WIDTH vectors within each rows
for (simd_chunk_idx, chunk) in row.exact_chunks(WIDTH).into_iter().enumerate() {
// Convert this chunk of SIMD vectors to the scalar layout,
// relying on autovectorization for performance for now...
let transposed: [[Float; WIDTH]; WIDTH] = std::array::from_fn(|outer_idx| {
std::array::from_fn(|inner_idx| chunk[inner_idx][outer_idx])
});
// ...then store these scalar vectors in the right location
for (vec_idx, data) in transposed.into_iter().enumerate() {
let scalar_row = simd_row_idx + vec_idx * simd_center_rows;
let scalar_col = simd_chunk_idx * WIDTH;
scalar_output
.slice_mut(s![scalar_row, scalar_col..scalar_col + WIDTH])
.as_slice_mut()
.unwrap()
.copy_from_slice(&data)
}
}
}
}
make_scalar(simd_v_center.view(), &mut self.scalar_v);
// Now scalar_v contains the scalar version of v
&self.scalar_v
}
As you may guess, this implementation could be optimized further…
- The autovectorized transpose could be replaced with hardware-specific SIMD swizzles.
- The repeated slicing of
scalar_v
could be replaced with a set of iterators that yield the right output chunks without any risk of unelided bounds checks.
…but given that even the most optimized data transpose is going to be costly due to how hardware works, it would probably best to optimize by simply saving scalar output less often!
The new simulation kernel
Finally, after going through all of this trouble, we can adapt the heart of the
simulation, the update()
loop, to the new data layout:
use std::simd::{LaneCount, SupportedLaneCount};
/// Simulation update function
#[multiversion(targets("x86_64+avx2+fma", "x86_64+avx", "x86_64+sse2"))]
pub fn update<const SIMD_WIDTH: usize>(
opts: &UpdateOptions,
start: &UV<SIMD_WIDTH>,
end: &mut UV<SIMD_WIDTH>,
) where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
let [num_simd_rows, num_simd_cols] = start.simd_shape();
let output_range = s![1..=(num_simd_rows - 2), 1..=(num_simd_cols - 2)];
ndarray::azip!((
win_u in start.u.windows([3, 3]),
win_v in start.v.windows([3, 3]),
out_u in end.u.slice_mut(output_range),
out_v in end.v.slice_mut(output_range),
) {
// Access the SIMD data corresponding to the center concentration
let u = win_u[STENCIL_OFFSET];
let v = win_v[STENCIL_OFFSET];
// Compute the diffusion gradient for U and V
let [full_u, full_v] = (win_u.into_iter())
.zip(win_v)
.zip(STENCIL_WEIGHTS.into_iter().flatten())
.fold(
[Vector::splat(0.); 2],
|[acc_u, acc_v], ((stencil_u, stencil_v), weight)| {
let weight = Vector::splat(weight);
[
acc_u + weight * (stencil_u - u),
acc_v + weight * (stencil_v - v),
]
},
);
// Compute SIMD versions of all the float constants that we use
let diffusion_rate_u = Vector::splat(DIFFUSION_RATE_U);
let diffusion_rate_v = Vector::splat(DIFFUSION_RATE_V);
let feedrate = Vector::splat(opts.feedrate);
let killrate = Vector::splat(opts.killrate);
let deltat = Vector::splat(opts.deltat);
let ones = Vector::splat(1.0);
// Compute the output values of u and v
let uv_square = u * v * v;
let du = diffusion_rate_u * full_u - uv_square + feedrate * (ones - u);
let dv = diffusion_rate_v * full_v + uv_square - (feedrate + killrate) * v;
*out_u = u + du * deltat;
*out_v = v + dv * deltat;
});
}
Notice the following:
- This code looks a lot simpler and closer to the regularized scalar code than the previous SIMD code which tried to adjust to the scalar data layout. And that’s all there is to it. No need for a separate update code path to handle the edges of the simulation domain!
- The SIMD vector width is not anymore an implementation detail that can be contained within the scope of this function, as it appears in the input and output types. Therefore, function multiversioning alone is not enough and we need genericity over the SIMD width too.
Adapting run_simulation()
and the HDF5Writer
The remaining changes in the shared computation code are minor. Since our
simulation runner allocates the concentration arrays, it must now become generic
over the SIMD vector type and adapt to the new Concentrations
API…
/// Simulation runner, with a user-specified concentration update function
pub fn run_simulation<const SIMD_WIDTH: usize>(
opts: &RunnerOptions,
// Notice that we must use FnMut here because the update function can be
// called multiple times, which FnOnce does not allow.
mut update: impl FnMut(&UV<SIMD_WIDTH>, &mut UV<SIMD_WIDTH>),
) -> hdf5::Result<()>
where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
// Set up the concentrations buffer
let mut concentrations = Concentrations::new(opts.num_rows, opts.num_cols);
// Set up HDF5 I/O
let mut hdf5 = HDF5Writer::create(
&opts.file_name,
concentrations.scalar_shape(),
opts.num_output_steps,
)?;
// Produce the requested amount of concentration arrays
for _ in 0..opts.num_output_steps {
// Run a number of simulation steps
for _ in 0..opts.compute_steps_per_output_step {
// Update the concentrations of the U and V chemical species
concentrations.update(&mut update);
}
// Write down the current simulation output
hdf5.write(concentrations.current_v())?;
}
// Close the HDF5 file
hdf5.close()
}
…while the write()
method of the HDF5Writer
must adapt to the fact that it
now only has access to the V species’ concentration, not the full dataset:
use ndarray::Array2;
// ...
/// Write a new V species concentration array to the file
pub fn write(&mut self, result_v: &Array2<Float>) -> hdf5::Result<()> {
self.dataset
.write_slice(result_v, (self.position, .., ..))?;
self.position += 1;
Ok(())
}
Exercise
Integrate all of these changes into your code repository. Then adjust both
the microbenchmark at benches/simulate.rs
and the main simulation binary at
src/bin/simulate.rs
to call the new run_simulation()
and update()
function.
You will need to use one final instance of function multiversioning in order to
determine the appropriate SIMD_WIDTH
inside of the top-level binaries. See
our initial SIMD chapter for an example of how this is done.
In the case of microbenchmarks, you will also need to tune the loop on problem sizes in order to stop running benchmarks on the 4x4 problem size, which is not supported by this implementation.
Add a microbenchmark that measures the overhead of converting data from the SIMD to the scalar data layout, complementing the simulation update microbenchmark that you already have.
And finally, measure the impact of the new data layout on the performance of simulation updates.
Unfortunately, you will likely find the results a little disappointing. The problem that causes this disappointment will be explained in the next chapter.
Any memory access that straddles a cache line boundary must load/store two cache lines instead of one, so the minimum penalty for these accesses is a 2x slowdown. Specific CPU architectures may come with higher misaligned memory access penalties, for example some RISC CPUs do not support unaligned memory accesses at all so every unaligned memory access must be decomposed in two memory accesses as described above.
The high software development costs of data layout changes are often used as an excuse not to do them. However, this is looking at the problem backwards. Data layout changes which are performed early in a program’s lifetime have a somewhat reasonable cost, so that actual issue here is starting to expose data-based interfaces and have other codes rely on them before your code is actually ready to commit to a stable data layout. Which should not be done until the end of the performance optimization process. This means that contrary to common “premature optimization is the root of all evil” programmer folk wisdom, there is actually a right time to do performance optimizations in a software project, and that time should not be at the very end of the development process, but rather as soon as you have tests to assess that your optimizations are not breaking anything and microbenchmarks to measure the impact of your optimizations.
We do not have the time to explore Rust genericity in this short course, but in a nutshell generic Rust code must be defined in such a way that it is either valid for all possible values of the generic parameters, or spells out what constitutes a valid generic parameter value. This ensures that instantiation-time errors caused by use of invalid generics parameters remain short and easy to understand in Rust, which overall makes Rust generics a lot more pleasant to use than C++ templates.
Some variation of this is needed because the LLVM backend underneath the
rustc compiler will crash the build if it is ever exposed a SIMD
vector width value that it does not expect, which is basically anything
but a power of two. But there are ongoing discussions on whether
SupportedLaneCount
is the right way to it. Therefore, be aware that this
part of the std::simd
API may change before stabilization.
An inlining puzzle
A major promise of C++, which was inherited by Rust, is that it should be possible to write reasonably straightforward high-level code, which compiles down to highly optimized machine code, by virtue of providing the compiler optimizer with all the information it needs to remove the high-level abstraction overhead at compile time.
Unfortunately, this promise comes with a footnote concerning function inlining:
- The process of turning high-level abstractions into efficient machine code vitally depends on a particular compiler optimization known as inline expansion or inlining. This optimization revolves around strategically copying and pasting the code of a function into its callers when it seems worthwhile, which enables many more optimizations down the line by letting the compiler specialize the function’s code for the context in which it is being called.
- Compiler optimizers decide whether to inline a function or not based on heuristics, and sometimes the heuristics is wrong and decides not to inline a function which should be inlined. This can result in an enormous runtime performance penalty.
In our case, as a quick run through a profiler will tell you, the performance problems that were observed at the end of the last chapter come from the fact that the compiler did not inline some iterator-related functions that are key to our code’s performance.
Available tools for inlining issues
When the function that is not inlined is a function that you wrote, there is
often an easy fix. Just annotate the function that fails to inline properly with
an
#[inline]
.
This will adjust the compiler optimizer’s cost model concerning this function,
and increase the odds that it does get inlined.
Most of the time, using #[inline]
will be enough to restore inlining where it
should happen, and bring back the runtime performance that you would expect. But
unfortunately, #[inline]
is just a hint, and the compiler’s optimizer may
occasionally refuse to take the hint and insist that in its highly educated
opinion, a function really should not be inlined.
For those difficult situations, Rust provides you with #[inline(always)]
,
which is a much stronger hint that a function should always be inlined, even
in debug builds. Basically, if it is at all possible to inline a function that
is annotated with #[inline(always)]
1, the compiler will inline it.
Unfortunately, while all of this is useful, it does not address one important use case: what should you do when the function that fails to inline is not defined in your code, but in a different library like one of your dependencies or the standard library, as happens in our case?
When this happens, you basically have three options:
- Get in touch with the developers of that library and try to get them to add
#[inline]
directives in the right place. This works, but can take a while, and may fail if the authors of the library are not convinced that most users need inlining for runtime performance.2 - Tweak the code in the hope of getting it into a shape that inlines better.
- Roll your own version of the affected function(s), applying as much inlining directives as necessary to get the performance that you want.
For practical reasons3, this chapter will cover the last two options.
Locating the bad inlining
The inlining failures that are killing our performance stick out in the output
of perf report
.
There are two hot functions here that should be inlined into their caller, but are not:
- The
next()
method of our zipped iterator over window elements and stencil weights. Failing to inline this will increase the cost of each loop iteration by some amount, since now iteration variables will need to be pushed to the stack and popped back when they could stay resident in CPU registers instead. This is already quite bad in such a tight loop. - The
ndarray::zip::Zip::inner()
method that performs the iteration over zipped iterators and does the actual computation is the biggest problem, however. Failing to inline this function into its multi-versioned caller breaks function multi-versioning, because the out-lined version will not be multiversioned. Therefore, it will only be compiled for the lowest denominator x86_64 SIMD instruction set of SSE2, which will cost us a lot of SIMD performance.
The first issue is actually surprisingly easy to resolve, once you know the trick. Just replace the 1D-style flattened iterator that makes the compiler optimizer choke with a version that separates the iteration over rows and columns of data:
// We give up on this...
(win_u.into_iter())
.zip(win_v)
.zip(STENCIL_WEIGHTS.into_iter().flatten())
// ...and instead go for this:
(win_u.rows().into_iter())
.zip(win_v.rows())
.zip(STENCIL_WEIGHTS)
.flat_map(|((u_row, v_row), weights_row)| {
(u_row.into_iter().copied())
.zip(v_row.into_iter().copied())
.zip(weights_row)
})
As it turns out, the compiler has an easier time inlining through two layers of simple iteration than one layer of complex iteration, which is annoying but fair enough.
The second issue, however, is more difficult to resolve. ndarray
’s
producer-zipping code is quite complex because it tries to support arrays of
arbitrary dimensionality, storage backend, and layout. Therefore, the amount of
code that would need to be marked #[inline]
would be quite large, and there is
a relatively high risk that upstream would reject an associated pull request
because it could have unforeseen side-effects on other code that calls
ndarray::azip!()
.4
Hence we are going to go for a different approach, and instead roll our own
4-arrays-zipping iterator that is specialized for our needs… and marked
#[inline]
the way we want it.
Introducing unsafe
A key design goal of Rust is that most of the code that you write should be proven to be type-safe, memory-safe and thread-safe, either at compile time or via runtime checks. For concision reasons, code that achieves the combination of these three properties is referred to as “safe code”.
Sadly, Rust cannot extend this proof to 100% of your code for two important reasons:
- There are useful software operations whose correctness cannot be proven at compile-time. Think about, for example, calling into a third-party library that is implemented in C/++. Since code written in these languages is not proven to be safe at compile time and does not formally document its runtime safety invariants, there is no way for the Rust compiler to prove that such a function call is safe. Yet being able to interoperate with the existing ecosystem of C/++ code is a very important asset that most Rust programmers would not want to give up on.
- Runtime checks come at a runtime performance cost, unless they are eliminated by the compiler’s optimizer. Unfortunately, compiler optimizers are not very good at this job, and it is very common for them to leave around runtime safety checks in situations where a human could easily prove that they will never be triggered. The associated performance cost may not be acceptable to programs with strict runtime performance requirements.
As a compromise, Rust therefore provides access to supplementary operations of
unproven safety, whose use is guarded by the unsafe
keyword. They have various
uses, but for our purposes, the operations of highest interest will be those
that replicate standard library constructs that involve safety checks, but
without the safety checks. Instead of crashing the program at runtime on
erronerous usage, these unsafe operations will instead trigger undefined
behavior and let the compiler trash your program in unpredictable ways, like
most C/++ operations do.
The intended use of these unsafe
operations is of course not to make Rust a
C/++-like undefined behavior free-for-all, turning all of the hard-earned
language safety guarantees into a fragile illusion that shatters on the first
use of the unsafe
keyword. Instead, unsafe
operations are meant to be used
inside of the implementation of safe operations, in order to do things that the
safe subset of the language cannot currently do due to compiler limitations,
like safely indexing into arrays at a known-good position without paying the
price of runtime bound checks.
As you may guess by now, the iterators of all basic collection types (arrays,
Vec
s…) are implemented using unsafe code. And so are the iterators of
ndarray
’s multidimensional arrays.
Of course, use of unsafe code is not without risks. The human proof backing the
implementation of safe operations may be wrong, and let the risk of Undefined
Behavior (UB) slip through, which is why frivolous use of unsafe
is highly
frowned upon in the Rust community. In a nutshell, unsafe code authors must keep
in mind that…
unsafe
code should only be used in situations where there is no safe way to perform the task with acceptable runtime performance. The Rust community’s tolerance for convoluted safe code that improve compiler optimizations through unnatural contortions is much higher than forunsafe
code that could be safe.- If a function is not marked as
unsafe
, there must be no combination of inputs that can lead to the emergence of undefined behavior when it is called. - If a function is marked as
unsafe
, its documentation must clearly spell out the safety contract that must be followed to avoid undefined behavior. This is needed so that users of the unsafe function know how to correctly use it later on.
An optimized iterator
This is an optimized version of our lockstep iteration pattern, implemented using unsafe code:
use ndarray::{ArrayView2, ShapeBuilder};
/// Optimized iterator over stencil output locations and input windows
#[inline]
pub fn stencil_iter<'data, const SIMD_WIDTH: usize>(
start: &'data UV<SIMD_WIDTH>,
end: &'data mut UV<SIMD_WIDTH>,
) -> impl Iterator<
Item = (
ArrayView2<'data, Vector<SIMD_WIDTH>>, // <- Input u window
ArrayView2<'data, Vector<SIMD_WIDTH>>, // <- Input v window
&'data mut Vector<SIMD_WIDTH>, // <- Output u
&'data mut Vector<SIMD_WIDTH>, // <- Output v
),
>
where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
// Assert that the sub-grids all have the same memory layout.
// This means that what we learn about one is valid for the other.
//
// It is fine to have a bunch of runtime assertions in an iterator
// constructor, because their cost will be amortized across all iterations.
assert_eq!(start.u.shape(), start.v.shape());
assert_eq!(start.u.shape(), end.u.shape());
assert_eq!(start.u.shape(), end.v.shape());
assert_eq!(start.u.strides(), start.v.strides());
assert_eq!(start.u.strides(), end.u.strides());
assert_eq!(start.u.strides(), end.v.strides());
// Collect and check common layout information
let in_shape = start.simd_shape();
assert!(in_shape.into_iter().min().unwrap() >= 2);
let strides = start.u.strides();
assert_eq!(strides.len(), 2);
assert!(strides.iter().all(|stride| *stride > 0));
let [row_stride, col_stride] = [strides[0] as usize, strides[1] as usize];
assert_eq!(col_stride, 1);
// Select the center of the simulation domain
let out_shape = in_shape.map(|dim| dim - 2);
let out_slice = s![1..=out_shape[0], 1..out_shape[1]];
let mut out_u = end.u.slice_mut(out_slice);
let mut out_v = end.v.slice_mut(out_slice);
assert_eq!(start.u.strides(), out_u.strides());
assert_eq!(start.u.strides(), out_v.strides());
let [out_rows, out_cols] = out_shape;
// Determine how many elements we must skip in order to go from the
// past-the-end element of one row to the first element of the next row.
let next_row_step = row_stride - out_cols;
// Prepare a way to access input windows and output refs by output position
// The safety of the closures below is actually asserted on the caller's
// side, but sadly unsafe closures aren't a thing in Rust yet.
let stencil_shape = [STENCIL_WEIGHTS.len(), STENCIL_WEIGHTS[0].len()];
let window_shape = (stencil_shape[0], stencil_shape[1]).strides((row_stride, 1));
let unchecked_output = move |out_ptr: *mut Vector<SIMD_WIDTH>| unsafe { &mut *out_ptr };
let unchecked_input_window = move |in_ptr: *const Vector<SIMD_WIDTH>| unsafe {
ArrayView2::from_shape_ptr(window_shape, in_ptr)
};
// Recipe to emit the currently selected input windows and output references,
// then move to the next column. As before, this is only safe if called with
// correct element pointers.
let emit_and_increment =
move |in_u_ptr: &mut *const Vector<SIMD_WIDTH>,
in_v_ptr: &mut *const Vector<SIMD_WIDTH>,
out_u_ptr: &mut *mut Vector<SIMD_WIDTH>,
out_v_ptr: &mut *mut Vector<SIMD_WIDTH>| unsafe {
let win_u = unchecked_input_window(*in_u_ptr);
let win_v = unchecked_input_window(*in_v_ptr);
let out_u = unchecked_output(*out_u_ptr);
let out_v = unchecked_output(*out_v_ptr);
*in_u_ptr = in_u_ptr.add(1);
*in_v_ptr = in_v_ptr.add(1);
*out_u_ptr = out_u_ptr.add(1);
*out_v_ptr = out_v_ptr.add(1);
(win_u, win_v, out_u, out_v)
};
// Set up iteration state
let mut in_u_ptr = start.u.as_ptr();
let mut in_v_ptr = start.v.as_ptr();
let mut out_u_ptr = out_u.as_mut_ptr();
let mut out_v_ptr = out_v.as_mut_ptr();
//
// End of the current row processed by out_v_ptr
let mut out_v_row_end = unsafe { out_v_ptr.add(out_cols) };
//
// End of the last row of the output grid
let out_v_end = unsafe { out_v_row_end.add(out_rows.saturating_sub(1) * row_stride) };
// Emit output iterator
std::iter::from_fn(move || {
// Common case : we are within the bounds of a row and advance normally
if out_v_ptr < out_v_row_end {
return Some(emit_and_increment(
&mut in_u_ptr,
&mut in_v_ptr,
&mut out_u_ptr,
&mut out_v_ptr,
));
}
// Otherwise, check if we reached the end of iteration
if out_v_ptr == out_v_end {
return None;
}
// We're at the end of a row, but not at the end of iteration:
// switch to the next row then emit the next element as usual
debug_assert_eq!(out_v_ptr, out_v_row_end);
unsafe {
in_u_ptr = in_u_ptr.add(next_row_step);
in_v_ptr = in_v_ptr.add(next_row_step);
out_u_ptr = out_u_ptr.add(next_row_step);
out_v_ptr = out_v_ptr.add(next_row_step);
out_v_row_end = out_v_ptr.add(out_cols);
}
Some(emit_and_increment(
&mut in_u_ptr,
&mut in_v_ptr,
&mut out_u_ptr,
&mut out_v_ptr,
))
})
}
We do not have the time to cover how it works in detail, but in a nutshell, it
is the same code that the iterator zip in our optimized SIMD implementation
should compile down to, and unlike the iterator zip, we wrote it and therefore
can put a hard-earned #[inline]
directive on it.
Exercise
Integrate these two iterator inlining optimizations into your code, and measure their effect on runtime performance. It should now be more in line (heh heh) with what you would expect considering the work that was put into SIMD layout improvements in the last chapter.
There is a lesson to be learned here: when an optimization does not have the payoff that you would expect, do not conclude that it is bad right away. Instead, take the time to figure out what’s going on, and whether your optimization is truly working as intended.
It is not always possible to inline function calls due to annoying edge cases like recursion.
This is why language and compiler authors should really get their act
together and complement function-level inlining directives with more
flexible call site inlining directives. But to the author’s knowledge,
only clang
has provided a basic matching C extension to this day. Here
are some possible reasons why:
- Compiler optimizer codebases tend to be very messy and understaffed, and extending them with a new optimization hint can take an unexpected amount of refactoring work.
- At the programming language level, designing good syntax for annotating individual function calls is not as easy as it seems, because modern programming languages feature many constructs that call functions but do not look like function calls, including any kind of operator overloading. And there are other interesting programming language design questions concerning how you would hint about transitive inlining beyond one directly annotated function call.
We do not want you to accidentally DDoS some poor open source maintainers with hundreds of issues and pull requests and wait for them to sort through the duplicate reports during the entire school.
There is a reason why sane compilers do not inline all function calls by default. Inlining means duplicating compilation work, which will have consequences in terms of compilation time and RAM consumption. It increases code size, which can cause I-cache problems at runtime. And by making caller functions too large, it can trigger other optimizer heuristics that tune down the amount of optimizations that is performed on the caller function. Basically, inlining is a tradeoff: it is very useful in the right place, but it can easily do more harm than good in the wrong place. Which is why it is very annoying that most compilers and programming languages only provide callee-side inlining hints at this point in time.
FMA vs ILP
If you paid close attention during the first part of this course, you may have been thinking for a while now that there is some code in the Gray-Scott computation that looks like a perfect candidate for introducing fused multiply-add CPU instructions.
More specifically, if we look at an iteration of the update loop…
// Access the SIMD data corresponding to the center concentration
let u = win_u[STENCIL_OFFSET];
let v = win_v[STENCIL_OFFSET];
// Compute the diffusion gradient for U and V
let [full_u, full_v] = (win_u.rows().into_iter())
.zip(win_v.rows())
.zip(STENCIL_WEIGHTS)
.flat_map(|((u_row, v_row), weights_row)| {
(u_row.into_iter().copied())
.zip(v_row.into_iter().copied())
.zip(weights_row)
})
.fold(
[Vector::splat(0.); 2],
|[acc_u, acc_v], ((stencil_u, stencil_v), weight)| {
let weight = Vector::splat(weight);
[
acc_u + weight * (stencil_u - u),
acc_v + weight * (stencil_v - v),
]
},
);
// Compute SIMD versions of all the float constants that we use
let diffusion_rate_u = Vector::splat(DIFFUSION_RATE_U);
let diffusion_rate_v = Vector::splat(DIFFUSION_RATE_V);
let feedrate = Vector::splat(opts.feedrate);
let killrate = Vector::splat(opts.killrate);
let deltat = Vector::splat(opts.deltat);
let ones = Vector::splat(1.0);
// Compute the output values of u and v
let uv_square = u * v * v;
let du = diffusion_rate_u * full_u - uv_square + feedrate * (ones - u);
let dv = diffusion_rate_v * full_v + uv_square - (feedrate + killrate) * v;
*out_u = u + du * deltat;
*out_v = v + dv * deltat;
…we can see that the diffusion gradient’s fold()
statement and the
computation of the output values of u
and v
are full of floating point
multiplications followed by additions.
It would surely seem sensible to replace these with fused multiply-add operations that compute multiplication-addition pairs 2x faster and more accurately at the same time!
Free lunch at last?
Enticed by the prospect of getting an easy 2x performance speedup at last, we might proceed to rewrite all the code using fused multiply-add operations…
// We must bring this trait in scope in order to use mul_add()
use std::simd::StdFloat;
// Compute the diffusion gradient for U and V
let [full_u, full_v] = (win_u.rows().into_iter())
.zip(win_v.rows())
.zip(STENCIL_WEIGHTS)
.flat_map(|((u_row, v_row), weights_row)| {
(u_row.into_iter().copied())
.zip(v_row.into_iter().copied())
.zip(weights_row)
})
.fold(
[Vector::splat(0.); 2],
|[acc_u, acc_v], ((stencil_u, stencil_v), weight)| {
let weight = Vector::splat(weight);
[
// NEW: Introduced FMA here
(stencil_u - u).mul_add(weight, acc_u),
(stencil_v - v).mul_add(weight, acc_v),
]
},
);
// Compute SIMD versions of all the float constants that we use
let diffusion_rate_u = Vector::splat(DIFFUSION_RATE_U);
let diffusion_rate_v = Vector::splat(DIFFUSION_RATE_V);
let feedrate = Vector::splat(opts.feedrate);
let killrate = Vector::splat(opts.killrate);
let deltat = Vector::splat(opts.deltat);
let ones = Vector::splat(1.0);
// Compute the output values of u and v
// NEW: Introduced even more FMA there
let uv_square = u * v * v;
let du = diffusion_rate_u.mul_add(full_u, (ones - u).mul_add(feedrate, -uv_square));
let dv = diffusion_rate_v.mul_add(full_v, -(feedrate + killrate).mul_add(v, uv_square));
*out_u = du.mul_add(deltat, u);
*out_v = dv.mul_add(deltat, v);
…and then run our microbenchmarks, full of hope…
simulate/16x16 time: [8.3555 µs 8.3567 µs 8.3580 µs]
thrpt: [30.629 Melem/s 30.634 Melem/s 30.638 Melem/s]
change:
time: [+731.23% +731.59% +731.89%] (p = 0.00 < 0.05)
thrpt: [-87.979% -87.975% -87.970%]
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
6 (6.00%) high mild
3 (3.00%) high severe
simulate/64x64 time: [68.137 µs 68.167 µs 68.212 µs]
thrpt: [60.048 Melem/s 60.088 Melem/s 60.114 Melem/s]
change:
time: [+419.07% +419.96% +421.09%] (p = 0.00 < 0.05)
thrpt: [-80.809% -80.768% -80.735%]
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
Benchmarking simulate/256x256: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50.
simulate/256x256 time: [891.62 µs 891.76 µs 891.93 µs]
thrpt: [73.476 Melem/s 73.491 Melem/s 73.502 Melem/s]
change:
time: [+327.65% +329.35% +332.59%] (p = 0.00 < 0.05)
thrpt: [-76.883% -76.709% -76.616%]
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) high mild
8 (8.00%) high severe
simulate/1024x1024 time: [10.769 ms 11.122 ms 11.489 ms]
thrpt: [91.266 Melem/s 94.276 Melem/s 97.370 Melem/s]
change:
time: [+34.918% +39.345% +44.512%] (p = 0.00 < 0.05)
thrpt: [-30.802% -28.236% -25.881%]
Performance has regressed.
Benchmarking simulate/4096x4096: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.9s, or reduce sample count to 70.
simulate/4096x4096 time: [71.169 ms 71.273 ms 71.376 ms]
thrpt: [235.05 Melem/s 235.39 Melem/s 235.74 Melem/s]
change:
time: [+0.1618% +0.4000% +0.6251%] (p = 0.00 < 0.05)
thrpt: [-0.6213% -0.3984% -0.1616%] Change within noise
threshold.
…but ultimately, we get nothing but disappointment and perplexity. What’s going on, and why does this attempt at optimization slow everything down to a crawl instead of speeding things up?
The devil in the details
Software performance optimization is in many ways like natural sciences: when experiment stubbornly refuses to agree with our theory, it is good to review our theoretical assumptions.
When we say that modern x86 hardware can compute an FMA for the same cost as a multiplication or an addition, what we actually mean is that hardware can compute two FMAs per cycle, much like it can compute two additions per cycle1 and two multiplications per cycle.
However, that statement comes with an “if” attached: our CPUs can only compute two FMAs per cycle if there is sufficient instruction-level parallelism in the code to keep the FMA units busy.
And it does not end there, those “if“s just keep piling up:
- FMAs have a higher latency than additions. Therefore it takes more instruction-level parallelism to hide that latency by excuting unrelated work while waiting for the results to come out. If you happen to be short on instruction-level parallelism, adding more FMAs will quickly make your program latency-bound, and thus slower.
- SIMD multiplication, addition and subtraction are quite flexible in terms of which registers or memory locations inputs can come from, which registers outputs can go to, and when negation can be applied. In contrast, because they have three operands, FMAs suffer from combinatorial explosion and end up with less supported patterns. It is therefore easier to end up in a situation where a single hardware FMA instruction cannot do the trick and you need more CPU instructions to do what looks like a single multiply + add/sub pattern in the code.
All this is to say, the hardware FMA implementations that we have today have mainly been designed to make CPUs score higher at LINPACK benchmarking contests, and achieved this goal with flying colors. As an unexpected bonus, it turns out that they may also prove useful when implementing several other very regular linear algebra and signal processing routines that do nothing but performing lots of independent FMAs in rapid succession. But for any kind of computation that exhibits a less trivial and regular pattern of floating point (add, mul) pairs, it may take serious work from your side to achieve the promised 2x speedup from the use of FMA instructions.
Take two
Keeping the above considerations in mind, we will start by scaling back our FMA ambitions, and only using FMAs in the initial part of the code that looks like a dot product.
The rationale for this is that the rest of the code is largely latency-bound, and irregular enough to hit FMA implementations corner cases. Therefore hardware FMA is unlikely to help there.
// Compute the diffusion gradient for U and V
let [full_u, full_v] = (win_u.rows().into_iter())
.zip(win_v.rows())
.zip(STENCIL_WEIGHTS)
.flat_map(|((u_row, v_row), weights_row)| {
(u_row.into_iter().copied())
.zip(v_row.into_iter().copied())
.zip(weights_row)
})
.fold(
[Vector::splat(0.); 2],
|[acc_u, acc_v], ((stencil_u, stencil_v), weight)| {
let weight = Vector::splat(weight);
[
// NEW: We keep the FMAs here...
(stencil_u - u).mul_add(weight, acc_u),
(stencil_v - v).mul_add(weight, acc_v),
]
},
);
// Compute SIMD versions of all the float constants that we use
let diffusion_rate_u = Vector::splat(DIFFUSION_RATE_U);
let diffusion_rate_v = Vector::splat(DIFFUSION_RATE_V);
let feedrate = Vector::splat(opts.feedrate);
let killrate = Vector::splat(opts.killrate);
let deltat = Vector::splat(opts.deltat);
let ones = Vector::splat(1.0);
// Compute the output values of u and v
// NEW: ...but we roll back the introduction of FMA there.
let uv_square = u * v * v;
let du = diffusion_rate_u * full_u - uv_square + feedrate * (ones - u);
let dv = diffusion_rate_v * full_v + uv_square - (feedrate + killrate) * v;
*out_u = u + du * deltat;
*out_v = v + dv * deltat;
If we benchmark this new implementation against the FMA-less implementation, we get results that look a lot more reasonable already when compared to the previous chapter’s version:
simulate/16x16 time: [1.0638 µs 1.0640 µs 1.0642 µs]
thrpt: [240.55 Melem/s 240.60 Melem/s 240.65 Melem/s]
change:
time: [+5.5246% +5.5631% +5.5999%] (p = 0.00 < 0.05)
thrpt: [-5.3029% -5.2699% -5.2354%]
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
simulate/64x64 time: [13.766 µs 13.769 µs 13.771 µs]
thrpt: [297.43 Melem/s 297.48 Melem/s 297.54 Melem/s]
change:
time: [+6.7803% +6.8418% +6.9153%] (p = 0.00 < 0.05)
thrpt: [-6.4680% -6.4037% -6.3498%]
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) low mild
1 (1.00%) high mild
3 (3.00%) high severe
simulate/256x256 time: [218.72 µs 218.75 µs 218.78 µs]
thrpt: [299.55 Melem/s 299.60 Melem/s 299.63 Melem/s]
change:
time: [+5.9635% +5.9902% +6.0154%] (p = 0.00 < 0.05)
thrpt: [-5.6741% -5.6516% -5.6278%]
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
simulate/1024x1024 time: [7.8361 ms 7.9408 ms 8.0456 ms]
thrpt: [130.33 Melem/s 132.05 Melem/s 133.81 Melem/s]
change:
time: [-0.3035% +1.6779% +3.6310%] (p = 0.09 > 0.05)
thrpt: [-3.5037% -1.6502% +0.3044%]
No change in performance detected.
Benchmarking simulate/4096x4096: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.9s, or reduce sample count to 70.
simulate/4096x4096 time: [70.898 ms 70.994 ms 71.088 ms]
thrpt: [236.01 Melem/s 236.32 Melem/s 236.64 Melem/s]
change:
time: [+2.5330% +2.7397% +2.9421%] (p = 0.00 < 0.05)
thrpt: [-2.8580% -2.6667% -2.4704%] Performance has
regressed.
That’s not a speedup yet, but at least it is not a massive slowdown anymore.
Breaking the (dependency) chain
The reason why our diffusion gradient computation did not get faster with FMAs
is that it features two long addition dependency chains, one for acc_u
values
and one for acc_v
values.
At each step of the original fold()
reduction loop, if we consider the
computation of acc_u
…
- There’s one
stencil_u - u
that can be computed as soon as inputstencil_u
is available.2 - Then one multiplication by
weight
which can only be done once that subtraction is done. - And there is one final accumulation which will need to wait until the result
of the multiplication is available AND the result of the previous
acc_u
computation is ready.
…and acc_v
plays a perfectly symmetrical role. But after introducing FMA, we
get this:
- As before, there’s one
stencil_u - u
that can be computed as soon as input is available. - And then there is one FMA operation that needs both the result of that
subtraction and the result of the previous
acc_u
computation.
Overall, we execute less CPU instructions per loop iteration. But we also
lengthened the dependency chain for acc_u
because the FMA that computes it has
higher latency. Ultimately, the two effects almost cancel each other out, and we
get performance that is close, but slightly worse.
Can we break this acc_u
dependency chain and speed things up by introducing
extra instruction-level parallelism, as we did before when summing
floating-point numbers? We sure can try!
However, it is important to realize that we must do it with care, because introducing more ILP increases our code’s SIMD register consumption, and we’re already putting the 16 available x86 SIMD registers under quite a bit of pressure.
Indeed, in an ideal world, any quantity that remains constant across update loop iterations would remain resident in a CPU register. But in our current code, this includes…
- All the useful stencil element values (which, after compiler optimizer
deduplication and removal of zero computations, yields one register full of
copies of the
0.5
constant and another full of copies of the0.25
constant). - Splatted versions of all 5 simulation parameters
diffusion_rate_u
,diffusion_rate_v
,feedrate
,killrate
anddeltat
. - One register full of
1.0
s, matching our variableones
. - And most likely one register full of
0.0
, because one of these almost always comes up in SIMD computations. To the point where many non-x86 CPUs optimize for it by providing a fake register from which reads always return 0.
This means that in the ideal world where all constants can be kept resident in registers, we already have 9 SIMD registers eaten up by those registers, and only have 7 registers left for the computation proper. In addition, for each loop iteration, we also need at a bare minimum…
- 2 registers for holding the center values of u and v.
- 2 registers for successively holding
stencil_u
,stencil_u - u
and(stencil_u - u) * weight
before accumulation, along with the v equivalent. - 2 registers for holding the current values of
acc_u
andacc_v
.
So all in all, we only have 1 CPU register left for nontrivial purposes before we start spilling our constants back to memory. Which means that introducing significant ILP (in the form of two new accumulators and input values) will necessarily come at the expense at spilling constants to memory and needing to reload them from memory later on. The question then becomes: are the benefits of ILP worth this expense?
Experiment time
Well, there is only one way to find out. First we try to introduce two-way ILP3…
// Compute SIMD versions of the stencil weights
let stencil_weights = STENCIL_WEIGHTS.map(|row| row.map(Vector::splat));
// Compute the diffusion gradient for U and V
let mut full_u_1 = (win_u[[0, 0]] - u) * stencil_weights[0][0];
let mut full_v_1 = (win_v[[0, 0]] - v) * stencil_weights[0][0];
let mut full_u_2 = (win_u[[0, 1]] - u) * stencil_weights[0][1];
let mut full_v_2 = (win_v[[0, 1]] - v) * stencil_weights[0][1];
full_u_1 = (win_u[[0, 2]] - u).mul_add(stencil_weights[0][2], full_u_1);
full_v_1 = (win_v[[0, 2]] - v).mul_add(stencil_weights[0][2], full_v_1);
full_u_2 = (win_u[[1, 0]] - u).mul_add(stencil_weights[1][0], full_u_2);
full_v_2 = (win_v[[1, 0]] - v).mul_add(stencil_weights[1][0], full_v_2);
assert_eq!(STENCIL_WEIGHTS[1][1], 0.0); // <- Will be optimized out
full_u_1 = (win_u[[1, 2]] - u).mul_add(stencil_weights[1][2], full_u_1);
full_v_1 = (win_v[[1, 2]] - v).mul_add(stencil_weights[1][2], full_v_1);
full_u_2 = (win_u[[2, 0]] - u).mul_add(stencil_weights[2][0], full_u_2);
full_v_2 = (win_v[[2, 0]] - v).mul_add(stencil_weights[2][0], full_v_2);
full_u_1 = (win_u[[2, 1]] - u).mul_add(stencil_weights[2][1], full_u_1);
full_v_1 = (win_v[[2, 1]] - v).mul_add(stencil_weights[2][1], full_v_1);
full_u_2 = (win_u[[2, 2]] - u).mul_add(stencil_weights[2][2], full_u_2);
full_v_2 = (win_v[[2, 2]] - v).mul_add(stencil_weights[2][2], full_v_2);
let full_u = full_u_1 + full_u_2;
let full_v = full_v_1 + full_v_2;
…and we observe that it is indeed quite a significant benefit for this computation:
simulate/16x16 time: [452.35 ns 452.69 ns 453.13 ns]
thrpt: [564.96 Melem/s 565.51 Melem/s 565.94 Melem/s]
change:
time: [-57.479% -57.442% -57.399%] (p = 0.00 < 0.05)
thrpt: [+134.73% +134.97% +135.18%]
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) high mild
7 (7.00%) high severe
simulate/64x64 time: [4.0774 µs 4.0832 µs 4.0952 µs]
thrpt: [1.0002 Gelem/s 1.0031 Gelem/s 1.0046 Gelem/s]
change:
time: [-70.417% -70.388% -70.352%] (p = 0.00 < 0.05)
thrpt: [+237.29% +237.70% +238.03%]
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
2 (2.00%) high mild
3 (3.00%) high severe
simulate/256x256 time: [59.714 µs 59.746 µs 59.791 µs]
thrpt: [1.0961 Gelem/s 1.0969 Gelem/s 1.0975 Gelem/s]
change:
time: [-72.713% -72.703% -72.691%] (p = 0.00 < 0.05)
thrpt: [+266.18% +266.35% +266.47%]
Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
4 (4.00%) low mild
3 (3.00%) high mild
9 (9.00%) high severe
simulate/1024x1024 time: [7.0386 ms 7.1614 ms 7.2842 ms]
thrpt: [143.95 Melem/s 146.42 Melem/s 148.98 Melem/s]
change:
time: [-11.745% -9.8149% -7.9159%] (p = 0.00 < 0.05)
thrpt: [+8.5963% +10.883% +13.308%]
Performance has improved.
simulate/4096x4096 time: [37.029 ms 37.125 ms 37.219 ms]
thrpt: [450.78 Melem/s 451.91 Melem/s 453.08 Melem/s]
change:
time: [-47.861% -47.707% -47.557%] (p = 0.00 < 0.05)
thrpt: [+90.684% +91.229% +91.797%]
Performance has improved.
Encouraged by this first success, we try to go for 4-way ILP…
let mut full_u_1 = (win_u[[0, 0]] - u) * stencil_weights[0][0];
let mut full_v_1 = (win_v[[0, 0]] - v) * stencil_weights[0][0];
let mut full_u_2 = (win_u[[0, 1]] - u) * stencil_weights[0][1];
let mut full_v_2 = (win_v[[0, 1]] - v) * stencil_weights[0][1];
let mut full_u_3 = (win_u[[0, 2]] - u) * stencil_weights[0][2];
let mut full_v_3 = (win_v[[0, 2]] - v) * stencil_weights[0][2];
let mut full_u_4 = (win_u[[1, 0]] - u) * stencil_weights[1][0];
let mut full_v_4 = (win_v[[1, 0]] - v) * stencil_weights[1][0];
assert_eq!(STENCIL_WEIGHTS[1][1], 0.0);
full_u_1 = (win_u[[1, 2]] - u).mul_add(stencil_weights[1][2], full_u_1);
full_v_1 = (win_v[[1, 2]] - v).mul_add(stencil_weights[1][2], full_v_1);
full_u_2 = (win_u[[2, 0]] - u).mul_add(stencil_weights[2][0], full_u_2);
full_v_2 = (win_v[[2, 0]] - v).mul_add(stencil_weights[2][0], full_v_2);
full_u_3 = (win_u[[2, 1]] - u).mul_add(stencil_weights[2][1], full_u_3);
full_v_3 = (win_v[[2, 1]] - v).mul_add(stencil_weights[2][1], full_v_3);
full_u_4 = (win_u[[2, 2]] - u).mul_add(stencil_weights[2][2], full_u_4);
full_v_4 = (win_v[[2, 2]] - v).mul_add(stencil_weights[2][2], full_v_4);
let full_u = (full_u_1 + full_u_2) + (full_u_3 + full_u_4);
let full_v = (full_v_1 + full_v_2) + (full_v_3 + full_v_4);
…and it does not change much anymore on the Intel i9-10900 CPU that I’m testing on here, although I can tell you that it still gives ~10% speedups on my AMD Zen 3 CPUs at home.4
simulate/16x16 time: [450.29 ns 450.40 ns 450.58 ns]
thrpt: [568.16 Melem/s 568.38 Melem/s 568.52 Melem/s]
change:
time: [-0.6248% -0.4375% -0.2013%] (p = 0.00 < 0.05)
thrpt: [+0.2017% +0.4394% +0.6287%]
Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
simulate/64x64 time: [4.0228 µs 4.0230 µs 4.0231 µs]
thrpt: [1.0181 Gelem/s 1.0182 Gelem/s 1.0182 Gelem/s]
change:
time: [-1.4861% -1.3761% -1.3101%] (p = 0.00 < 0.05)
thrpt: [+1.3275% +1.3953% +1.5085%]
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) low severe
4 (4.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
simulate/256x256 time: [60.567 µs 60.574 µs 60.586 µs]
thrpt: [1.0817 Gelem/s 1.0819 Gelem/s 1.0820 Gelem/s]
change:
time: [+1.3650% +1.4165% +1.4534%] (p = 0.00 < 0.05)
thrpt: [-1.4326% -1.3967% -1.3467%]
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
1 (1.00%) low severe
1 (1.00%) low mild
3 (3.00%) high mild
10 (10.00%) high severe
simulate/1024x1024 time: [6.8775 ms 6.9998 ms 7.1208 ms]
thrpt: [147.26 Melem/s 149.80 Melem/s 152.47 Melem/s]
change:
time: [-4.5709% -2.2561% +0.0922%] (p = 0.07 > 0.05)
thrpt: [-0.0921% +2.3082% +4.7898%]
No change in performance detected.
simulate/4096x4096 time: [36.398 ms 36.517 ms 36.641 ms]
thrpt: [457.88 Melem/s 459.44 Melem/s 460.94 Melem/s]
change:
time: [-2.0551% -1.6388% -1.2452%] (p = 0.00 < 0.05)
thrpt: [+1.2609% +1.6661% +2.0982%]
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Finally, we can try to implement maximal 8-way ILP…
let full_u_1 = (win_u[[0, 0]] - u) * stencil_weights[0][0];
let full_v_1 = (win_v[[0, 0]] - v) * stencil_weights[0][0];
let full_u_2 = (win_u[[0, 1]] - u) * stencil_weights[0][1];
let full_v_2 = (win_v[[0, 1]] - v) * stencil_weights[0][1];
let full_u_3 = (win_u[[0, 2]] - u) * stencil_weights[0][2];
let full_v_3 = (win_v[[0, 2]] - v) * stencil_weights[0][2];
let full_u_4 = (win_u[[1, 0]] - u) * stencil_weights[1][0];
let full_v_4 = (win_v[[1, 0]] - v) * stencil_weights[1][0];
assert_eq!(STENCIL_WEIGHTS[1][1], 0.0);
let full_u_5 = (win_u[[1, 2]] - u) * stencil_weights[1][2];
let full_v_5 = (win_v[[1, 2]] - v) * stencil_weights[1][2];
let full_u_6 = (win_u[[2, 0]] - u) * stencil_weights[2][0];
let full_v_6 = (win_v[[2, 0]] - v) * stencil_weights[2][0];
let full_u_7 = (win_u[[2, 1]] - u) * stencil_weights[2][1];
let full_v_7 = (win_v[[2, 1]] - v) * stencil_weights[2][1];
let full_u_8 = (win_u[[2, 2]] - u) * stencil_weights[2][2];
let full_v_8 = (win_v[[2, 2]] - v) * stencil_weights[2][2];
let full_u = ((full_u_1 + full_u_2) + (full_u_3 + full_u_4))
+ ((full_u_5 + full_u_6) + (full_u_7 + full_u_8));
let full_v = ((full_v_1 + full_v_2) + (full_v_3 + full_v_4))
+ ((full_v_5 + full_v_6) + (full_v_7 + full_v_8));
…and it is undisputably slower:
simulate/16x16 time: [486.03 ns 486.09 ns 486.16 ns]
thrpt: [526.58 Melem/s 526.65 Melem/s 526.72 Melem/s]
change:
time: [+7.6024% +7.8488% +7.9938%] (p = 0.00 < 0.05)
thrpt: [-7.4021% -7.2776% -7.0653%]
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
16 (16.00%) high mild
2 (2.00%) high severe
simulate/64x64 time: [4.6472 µs 4.6493 µs 4.6519 µs]
thrpt: [880.51 Melem/s 881.00 Melem/s 881.38 Melem/s]
change:
time: [+15.496% +15.546% +15.598%] (p = 0.00 < 0.05)
thrpt: [-13.494% -13.454% -13.417%]
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) high mild
5 (5.00%) high severe
simulate/256x256 time: [68.774 µs 68.923 µs 69.098 µs]
thrpt: [948.45 Melem/s 950.86 Melem/s 952.91 Melem/s]
change:
time: [+13.449% +13.563% +13.710%] (p = 0.00 < 0.05)
thrpt: [-12.057% -11.943% -11.854%]
Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
1 (1.00%) low mild
4 (4.00%) high mild
16 (16.00%) high severe
simulate/1024x1024 time: [7.0141 ms 7.1438 ms 7.2741 ms]
thrpt: [144.15 Melem/s 146.78 Melem/s 149.50 Melem/s]
change:
time: [-0.2910% +2.0563% +4.6567%] (p = 0.12 > 0.05)
thrpt: [-4.4495% -2.0149% +0.2918%]
No change in performance detected.
simulate/4096x4096 time: [38.128 ms 38.543 ms 38.981 ms]
thrpt: [430.39 Melem/s 435.29 Melem/s 440.03 Melem/s]
change:
time: [+4.2543% +5.5486% +6.7916%] (p = 0.00 < 0.05)
thrpt: [-6.3597% -5.2569% -4.0807%]
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
This should not surprise you: at this point, we have fully lost the performance benefits of fused multiply-add, and we use so many registers for our inputs and accumulators that the compiler will need to generate a great number of constant spills and reloads in order to fit the computation into the 16 microarchitectural x86 SIMD registers.
All in all, for this computation, the 4-way ILP version can be declared our performance winner.
Exercise
The astute reader will have noticed that we cannot compare between FMA and multiply-add sequences yet because we have only implemented ILP optimizations in the FMA-based code, not the original code based on non fused multiply-add sequences.
Please assess the true performance benefits of FMAs on this computation by
starting from the 4-way ILP FMA version and replacing all mul_add
s with
multiply-add sequences, then comparing the benchmark results in both cases.
Find the results interesting? We will get back to them in the next chapter.
Yes, I know about those newer Intel CPU cores that have a third adder. Thanks Intel, that was a great move! Now wake me up when these chips are anywhere near widespread in computing centers.
The astute reader will have noticed that we can actually get rid of this subtraction by changing the center weight of the stencil. As this optimization does not involve any new Rust or HPC concept, it is not super interesting in the context of this course, so in the interest of time we leave it as an exercise to the reader. But if you’re feeling lazy and prefer to read code than write it, it has been applied to the reference implementation.
Must use the traditional duplicated code ILP style here because
iterator_ilp
cannot implement the optimization of ignoring the zero
stencil value at the center, which is critical to performance here.
This is not surprising because AMD Zen CPUs have more independent floating-point ALUs than most Intel CPUs, and thus it takes more instruction-level parallelism to saturate their execution backend.
Cache blocking
Warning: Although the optimizations described in this section were useful on older Intel CPUs, they do not seem to be beneficial anymore since the Skylake generation, likely because the L2 cache of Intel CPUs got smart enough to handle things on its own in this relatively simple simulation.
However, in more complex computations, the need for this sort of optimization keeps coming up regularly even on newer CPUs. Hence we keep the chapter around so that you can still get a feeling of how it’s done on a simpler codebase.
If we run our latest version through perf stat -d
, we will see that above a
certain problem size, it seems to be slowed down by memory access issues:
$ perf stat -d -- cargo bench --bench simulate -- --profile-time=10 1024x
Finished `bench` profile [optimized + debuginfo] target(s) in 0.05s
Running benches/simulate.rs (target/release/deps/simulate-f304f2306d63383e)
Gnuplot not found, using plotters backend
Benchmarking simulate/1024x1024: Complete (Analysis Disabled)
Performance counter stats for 'cargo bench --bench simulate -- --profile-time=10 1024x':
15 457,63 msec task-clock # 1,000 CPUs utilized
45 context-switches # 2,911 /sec
0 cpu-migrations # 0,000 /sec
16 173 page-faults # 1,046 K/sec
66 043 605 917 cycles # 4,273 GHz (50,05%)
32 277 621 790 instructions # 0,49 insn per cycle (62,55%)
1 590 473 709 branches # 102,892 M/sec (62,56%)
4 078 658 branch-misses # 0,26% of all branches (62,52%)
6 694 606 694 L1-dcache-loads # 433,094 M/sec (62,50%)
1 395 491 611 L1-dcache-load-misses # 20,85% of all L1-dcache accesses (62,46%)
197 653 531 LLC-loads # 12,787 M/sec (49,97%)
2 226 645 LLC-load-misses # 1,13% of all LL-cache accesses (50,01%)
15,458909585 seconds time elapsed
15,411864000 seconds user
0,047092000 seconds sys
Indeed, 20% of our data accesses miss the L1 data cache, and possibly as a result, the CPU only executes one SIMD instruction every two clock cycles. Can we improve upon this somehow?
Visualizing the problem
Recall that our memory access pattern during iteration looks like the following schematic, where the outer blue-colored square represents SIMD vectors that are being read and the inner red-colored square represents SIMD vectors that are being written:
There is, however, a more hardware-centered way to study this pattern, which is to investigate which SIMD vectors must be brought into the L1 CPU cache at each step of the computation, and which SIMD vectors can be reused from previous steps.
For simplicity, we will draw our schematics as if cache lines did not exist and CPUs truly loaded or stored one single SIMD vector whenever we ask them to do so.
In the case of input data, the pattern looks like this, where pink represents data that is newly brought into the CPU cache, and green represents data that can be reused from previous iterations:
And in the case of output data, the pattern looks like this:
So far so good. As long as we are generating the first row of output data, there is no way to improve upon this pattern. The CPU cache is working as intended, speeding up at least 2/3 of our memory accesses, and it actually helps even more than this when more advanced cache features like hardware prefetching are taken into account.
The question is, what will happen when we reach the second row of output data? Can we reuse the data that we loaded when processing the first row of data, like this?
Or do we need to load everything again from RAM or a lower-level cache, like this?
The answer to this question actually depends on how many columns our simulation domain has:
- If the simulation domain has few columns, then by the time we start processing the second row of output, the input data will still be fresh in the L1 CPU cache and can efficiently be reused.
- But if the simulation domain has many columns, by the time we get to the second row, the input data from the beginning of the first row will have been silently dropped by the L1 CPU cache, and we will need to slowly reload it from RAM or a lower-level CPU cache.
We can estimate where the limit lies using relatively simple math:
- Let use denote
S
the size of the L1 CPU cache in bytes andC
the width of the simulation domain in scalar columns. - In our optimized SIMD data layout, rows of data are padded by 1 SIMD vector of
zeros on both sides, so we actually have
K = C + 2 * W
scalar data columns in our internal tables, whereW
is our SIMD vector width. - To produce a row of output, we must read
3 * K
data points from the two input tables representing the starting concentrations of species U and V, and we must writeK - 2 * W = C
data points to the two matching output tables. - Each data point is a number of type
Float
, which is currently configured to be of typef32
. Therefore, aFloat
is currently 4 bytes. - So overall, the CPU cache footprint that is associated while reading input for
an entire row is
2 * (3 * K) * 4 = 24 * K
bytes and the CPU cache footprint that is associated with writing output for an entire row is2 * C * 4 = 8 * C
bytes. - By combining these two expressions, it follows that the total CPU cache
footprint associated with producing one row of output is
24 * K + 8 * C
bytes. Which, if we inject the value ofK
into the expression, translates into32 * C + 48 * W
bytes. - For optimal performance, we would like all this data to fit in L1 cache, so
that input data is still accessible by the time we start processing the second
row, knowing that we need some headroom for other things that go into the L1
cache like constant data. So overall we would like to have
32 * C + 48 * W < S
. - And because we are actually interested in the maximal value of
C
, we rewrite this expression into the mathematically equivalentC < (S - 48 * W) / 32
.
By injecting concrete values of S
and W
into this expression, we get a
maximal value of C
for which a given CPU can operate at optimal L1 cache
efficiency.
For example, if we consider an Intel i9-10900 CPU with 32 KiB of L1d cache and 256-bit AVX vectors, the limit of L1d cache capacity is reached when the simulation domain has around 976 columns (not accounting for other data which must fit in L1d cache). And unsurprisingly, this does indeed match the point where our microbenchmark’s performance drops.
So far so good, we have a theoretical model that is good enough to model the issue. Let us now use it to improve L1 cache hit rates at larger problem sizes!
The loop blocking trick
So far, we have used a simple iteration scheme where we fully iterate over each row of output concentrations, before moving to the next one:
It doesn’t have to be that way however. We could use a different scheme where we slice our data into a number of vertical blocks, and iterate over each vertical block before moving to the next one.
For example, we could first iterate over the left half of the data like this…
…and then iterate over the right half like this:
The computation is not sensitive to the order in which data points are computed, so this change of iteration order will not affect the results. What it will do, however, is to reduce the number of columns that are traversed before the computation moves to the next line, ideally ensuring that a large computation which did not fully leverage the CPU’s L1 cache before is now able to do so.
How wide should the blocks be? Well, there is a trade-off there. On one hand, CPUs work best when they are processing long, uninterrupted sequences of contiguous data, and from this perspective longer rows of data are better. On the other hand, the whole point of this chapter is to show you that long row lengths can be problematic for L1d cache locality. Taken together, these two statements entail that we should strive to make our blocks should as wide as available L1d cache capacity will allow, but no wider.
Which is where the simple theoretical model that we derived previously comes into play: using it, we will be able to determine how wide blocks could ideally be, assuming that the L1d cache were fully available for simulation data. We will then shrink this estimate using an empirically tuned safety factor in order to leave some L1d cache headroom for other program state (like constants) and CPU memory subsystem black magic like prefetches. And this is how we will get our optimal block width.
Determining the block width
Recall that our simple theoretical model gives us an C < (S - 48 * W) / 32
bound on block width, where S
is the size of the L1d cache and W
is our SIMD
vector width. We already have seen how one can query the SIMD vector width. But
where are we supposed to find out how large the L1d cache is, given that it
depends on your CPU, and sometimes even on the core where you are executing?1
In a C program, the easiest way to query the cache layout of your CPU would be
to use the excellent hwloc
library, which abstracts over hardware and OS specifics and gives you an easy
and portable way to query CPU topology information. But since we are doing Rust,
using hwloc
directly will involve some unpleasant and unidiomatic unsafe
constructs.
Therefore, my recommendation will instead be for you to use
hwlocality
, a safe binding
that I wrote on top of hwloc
in order to make it easier to use from Rust.
We start by adding it to our project via a now familiar procedure…2
cargo add hwlocality
And then, inside of our code, we can easily use it to find the minimum L1d cache size across all available CPU cores:
use hwlocality::Topology;
let topology = Topology::new().expect("Failed to query hwloc topology");
let cache_stats = topology
.cpu_cache_stats()
.expect("Failed to probe CPU cache stats");
let l1d_size = cache_stats.smallest_data_cache_sizes()[0];
From this, we can deduce an upper bound on the optimal block width…
let l1d_size = smallest_l1d_cache_size();
let float_size = std::mem::size_of::<Float>();
let max_scalar_block_width = (l1d_size as usize - 12 * float_size * simd_width) / (8 * float_size);
…convert it to a number of SIMD vectors to account for our actual data layout…
let max_simd_block_width = max_scalar_block_width / simd_width;
…and shrink it down by a safety factor (that we will late tune through microbenchmarking) in order to account for other uses of the L1d cache that our simple theoretical model does not cover:
// FIXME: Adjust this safety factor through microbenchmarking
let ideal_simd_block_width = (max_simd_block_width as f32 * 0.8) as usize
Putting it all together, we get this function:
/// Determine the optimal block size of the computation
pub fn simd_cols_per_block(simd_width: usize) -> usize {
let topology = Topology::new().expect("Failed to query hwloc topology");
let cache_stats = topology
.cpu_cache_stats()
.expect("Failed to probe CPU cache stats");
let l1d_size = cache_stats.smallest_data_cache_sizes()[0];
let float_size = std::mem::size_of::<Float>();
let max_scalar_block_width =
(l1d_size as usize - 12 * float_size * simd_width) / (8 * float_size);
let max_simd_block_width = max_scalar_block_width / simd_width;
// FIXME: Adjust this safety factor through microbenchmarking
(max_simd_block_width as f32 * 0.8) as usize
}
One thing to bear in mind here is that although the code may look innocent
enough, computing the block size involves some relatively expensive operations
like querying the CPU’s memory subsystem topology from the OS. So we should not
do it on every call to the update()
function. Instead, it should be computed
once during simulation initialization and kept around across update()
calls.
Block-wise iteration
Now that we have a block size, let’s slice up our computation domain into actual blocks.
We start by adding a parameter to our update method so that the caller can pass in the precalculated chunk size.
pub fn update<const SIMD_WIDTH: usize>(
opts: &UpdateOptions,
start: &UV<SIMD_WIDTH>,
end: &mut UV<SIMD_WIDTH>,
cols_per_block: usize, // <- This is new
) where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
// TODO: Implementation
}
Then we extract the center of the output array and we slice it up into non-overlapping chunks:
use ndarray::Axis;
let center_shape = end.simd_shape().map(|dim| dim - 2);
let center = s![1..=center_shape[0], 1..=center_shape[1]];
let mut end_u_center = end.u.slice_mut(center);
let mut end_v_center = end.v.slice_mut(center);
let end_u = end_u_center.axis_chunks_iter_mut(Axis(1), cols_per_block);
let end_v = end_v_center.axis_chunks_iter_mut(Axis(1), cols_per_block);
So far, ndarray
makes life easy for us. But unfortunately, it does not have an
axis iterator that matches the semantics that we have for input windows, and
therefore we are going to need to hack it using careful indexing.
We start by iterating over output blocks, using enumerate()
and a counter of
blocks to tell when we are going to reach the last block (which may be narrower
than the other blocks)..
let num_blocks = center_shape[1].div_ceil(cols_per_block);
for (idx, (end_u, end_v)) in end_u.zip(end_v).enumerate() {
let is_last = idx == (num_blocks - 1);
// TODO: Process one output block here
}
…and then we slice up input blocks of the right size:
use ndarray::Slice;
let input_base = idx * cols_per_block;
let input_slice = if is_last {
Slice::from(input_base..)
} else {
Slice::from(input_base..input_base + cols_per_block + 2)
};
let start_u = start.u.slice_axis(Axis(1), input_slice);
let start_v = start.v.slice_axis(Axis(1), input_slice);
At this point, we have correctly sized start_u
, start_v
, end_u
and end_v
blocks. But our stencil_iter()
function cannot accept them yet because the
code has so far been specialized to take full UV
structs as input, and cannot
handle chunks of the simulation domain yet.
I will spare you the required code adjustments since the fine art of
generalizing unsafe
Rust code without compromizing its safety is beyond
the scope of this short course. But in the end we get this:
use ndarray::ArrayViewMut2;
#[inline]
pub fn stencil_iter<'data, const SIMD_WIDTH: usize>(
start_u: ArrayView2<'data, Vector<SIMD_WIDTH>>,
start_v: ArrayView2<'data, Vector<SIMD_WIDTH>>,
mut end_u_center: ArrayViewMut2<'data, Vector<SIMD_WIDTH>>,
mut end_v_center: ArrayViewMut2<'data, Vector<SIMD_WIDTH>>,
) -> impl Iterator<
Item = (
ArrayView2<'data, Vector<SIMD_WIDTH>>, // <- Input u window
ArrayView2<'data, Vector<SIMD_WIDTH>>, // <- Input v window
&'data mut Vector<SIMD_WIDTH>, // <- Output u
&'data mut Vector<SIMD_WIDTH>, // <- Output v
),
>
where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
// Assert that the sub-grids all have the expected memory layout.
assert_eq!(start_u.shape(), start_v.shape());
assert_eq!(end_u_center.shape(), end_v_center.shape());
assert_eq!(start_u.shape().len(), 2);
assert_eq!(end_u_center.shape().len(), 2);
assert!(start_u
.shape()
.iter()
.zip(end_u_center.shape())
.all(|(start_dim, end_dim)| *start_dim == end_dim + 2));
assert_eq!(start_u.strides(), start_v.strides());
assert_eq!(start_u.strides(), end_u_center.strides());
assert_eq!(start_u.strides(), end_v_center.strides());
// Collect and check common layout information
let in_shape = [start_u.shape()[0], start_u.shape()[1]];
assert!(in_shape.into_iter().min().unwrap() >= 2);
let strides = start_u.strides();
assert_eq!(strides.len(), 2);
assert!(strides.iter().all(|stride| *stride > 0));
let [row_stride, col_stride] = [strides[0] as usize, strides[1] as usize];
assert_eq!(col_stride, 1);
let [out_rows, out_cols] = in_shape.map(|dim| dim - 2);
// Determine how many elements we must skip in order to go from the
// past-the-end element of one row to the first element of the next row.
let next_row_step = row_stride - out_cols;
// Prepare a way to access input windows and output refs by output position
// The safety of the closures below is actually asserted on the caller's
// side, but sadly unsafe closures aren't a thing in Rust yet.
let stencil_shape = [STENCIL_WEIGHTS.len(), STENCIL_WEIGHTS[0].len()];
let window_shape = (stencil_shape[0], stencil_shape[1]).strides((row_stride, 1));
let unchecked_output = move |out_ptr: *mut Vector<SIMD_WIDTH>| unsafe { &mut *out_ptr };
let unchecked_input_window = move |in_ptr: *const Vector<SIMD_WIDTH>| unsafe {
ArrayView2::from_shape_ptr(window_shape, in_ptr)
};
// Recipe to emit the currently selected input windows and output references,
// then move to the next column. As before, this is only safe if called with
// correct element pointers.
let emit_and_increment =
move |in_u_ptr: &mut *const Vector<SIMD_WIDTH>,
in_v_ptr: &mut *const Vector<SIMD_WIDTH>,
out_u_ptr: &mut *mut Vector<SIMD_WIDTH>,
out_v_ptr: &mut *mut Vector<SIMD_WIDTH>| unsafe {
let win_u = unchecked_input_window(*in_u_ptr);
let win_v = unchecked_input_window(*in_v_ptr);
let out_u = unchecked_output(*out_u_ptr);
let out_v = unchecked_output(*out_v_ptr);
*in_u_ptr = in_u_ptr.add(1);
*in_v_ptr = in_v_ptr.add(1);
*out_u_ptr = out_u_ptr.add(1);
*out_v_ptr = out_v_ptr.add(1);
(win_u, win_v, out_u, out_v)
};
// Set up iteration state
let mut in_u_ptr = start_u.as_ptr();
let mut in_v_ptr = start_v.as_ptr();
let mut out_u_ptr = end_u_center.as_mut_ptr();
let mut out_v_ptr = end_v_center.as_mut_ptr();
//
// End of the current row processed by out_v_ptr
let mut out_v_row_end = unsafe { out_v_ptr.add(out_cols) };
//
// End of the last row of the output grid
let out_v_end = unsafe { out_v_row_end.add(out_rows.saturating_sub(1) * row_stride) };
// Emit output iterator
std::iter::from_fn(move || {
// Common case : we are within the bounds of a row and advance normally
if out_v_ptr < out_v_row_end {
return Some(emit_and_increment(
&mut in_u_ptr,
&mut in_v_ptr,
&mut out_u_ptr,
&mut out_v_ptr,
));
}
// Otherwise, check if we reached the end of iteration
if out_v_ptr == out_v_end {
return None;
}
// We're at the end of a row, but not at the end of iteration:
// switch to the next row then emit the next element as usual
debug_assert_eq!(out_v_ptr, out_v_row_end);
unsafe {
in_u_ptr = in_u_ptr.add(next_row_step);
in_v_ptr = in_v_ptr.add(next_row_step);
out_u_ptr = out_u_ptr.add(next_row_step);
out_v_ptr = out_v_ptr.add(next_row_step);
out_v_row_end = out_v_ptr.add(out_cols);
}
Some(emit_and_increment(
&mut in_u_ptr,
&mut in_v_ptr,
&mut out_u_ptr,
&mut out_v_ptr,
))
})
}
Basically, where we used to slice the center of the output ourselves, the caller is now responsible for doing it, and the rest is similar as slicing an N-d array does not affect the memory stride from one row to the next, which is the main low-level layout information that we rely on here.
Exercise
Integrate this loop blocking optimization into the code. Note that this will
require some changes to the run_simulation
function.
Then microbenchmark the code, and adjust the security factor in the
implementation of simd_cols_per_block
to see how it affects
performance. The results may surprise you!3
As seen in power efficiency focused CPUs like Arm big.LITTLE and Intel Adler Lake.
Before it works, however, you will need to also ensure that your Linux
distribution’s equivalent of Ubuntu and Debian’s libhwloc-dev
package is
installed. Unfortunately, C dependency management is not quite at the same
level of convenience as Rust’s cargo add
…
…and in fact, I am still suspicious of them myself, and would like to spend more time analyzing them later on. Something for a future edition of this school?
Parallelism
Well, that last chapter was a disappointment. All this refactoring work, only to
find out in the final microbenchmark that the optimization we implemented does
improve L1d cache hit rate (as can be measured using perf stat -d
), but this
improvement in this CPU utilization efficiency metric does not translate into an
improvement in execution speed.
The reason is not clear to me at this point in time, unfortunately. It could be several things:
- The CPU manages to hide the latency of L1d cache misses by executing other pending instructions. So even at a 20% L1d cache miss rate we are still not memory-bound.
- The optimization is somehow not implemented properly and costs more than it helps. I have checked for absence of simple issues here, but there could be more subtle ones around.
- There is another, currently unknown factor preventing the execution of more instructions per cycles. So even if data is more readily available in the L1d cache, we still can’t use it yet.
Hopefully I will find the time to clarify this before the next edition of this school. But for now, let us move to the last optimization that should be performed once we are convinced that we are using a single CPU core as efficiently as we can, namely using all of our other CPU cores.
Another layer of loop blocking
There is an old saying that almost every problem in programming can be resolved by adding another layer of indirection. However it could just as well be argued that almost every problem in software performance optimization can be resolved by adding another layer of loop blocking.
In this particular case, the loop blocking that we are going to add revolves around slicing our simulation domain into independent chunks that can be processed by different CPU cores for parallelism. Which begs the question: in which direction should all those chunks be cut? Across rows or columns? And how big should they be?
Let’s start with the first question. We are using ndarray
, which in its
default configuration stores data in row-major order. And we also know that CPUs
are very fond of iterating across data in long linear patterns, and will make
you pay a hefty price for any sort of jump across the underlying memory buffer.
Therefore, we should think twice before implementing any sort of chunking that
makes the rows that we are iterating over shorter, which means that chunking the
data into blocks of rows for parallelization is a better move.
As for how big the chunks should be, it is basically a balance between two factors:
- Exposing more opportunities for parallelism and load balancing by cutting
smaller chunks. This pushes us towards cutting the problem into at least
N
chunks where N is the number of CPU cores that our system has, and preferably more to allow for dynamic load balancing of tasks between CPU cores when some cores process work slower than others.1 - Amortizing the overhead of spawning and awaiting parallel work by cutting larger chunks. This pushes us towards cutting the problem into chunks no smaller than a certain size, dictated by processing speed and task spawning and joining overhead.
The rayon
library can take care of the first concern for us by dynamically
splitting work as many times as necessary to achieve good load balancing on the
specific hardware and system that we’re dealing with at runtime. But as we have
seen before, it is not good at enforcing sequential processing cutoffs. Hence we
will be taking that matter into our own hands.
Configuring the minimal block size
In the last chapter, we have been using a hardcoded safety factor to pick the number of columns in each block, and you could hopefully see during the exercises that this made the safety factor unpleasant to tune. This chapter will thus introduce you to the superior approach of making the tuning parameters adjustable via CLI parameters and environment variables.
clap
makes this very easy. First we enable the environment variable feature…
cargo add --features=env clap
…we add the appropriate option to our UpdateOptions
struct…
#[derive(Debug, Args)]
pub struct UpdateOptions {
// ... unchanged existing options ...
/// Minimal number of data points processed by a parallel task
#[arg(env, long, default_value_t = 100)]
pub min_elems_per_parallel_task: usize,
}
That’s it. Now if we either pass the --min-elems-per-parallel-task
option to
the simulate
binary or set the MIN_ELEMS_PER_PARALLEL_TASK
environment
variable, that can be used as our sequential processing granularity in the code
that we are going to write next.
Adding parallelism
We then begin our parallelism journey by enabling the rayon
support of the
ndarray
crate. This enables some ndarray
producers to be
turned into Rayon parallel iterators.
cargo add --features=rayon ndarray
Next we split our update()
function into two:
- One top-level
update()
function that will be in charge of receiving user parameters, extracting the center of the output arrays, and parallelizing the work if deemed worthwhile. - One inner
update_seq()
function that will do most of the work that we did before, but using array windows instead of manipulating the full concentration arrays directly.
Overall, it looks like this:
/// Parallel simulation update function
pub fn update<const SIMD_WIDTH: usize>(
opts: &UpdateOptions,
start: &UV<SIMD_WIDTH>,
end: &mut UV<SIMD_WIDTH>,
cols_per_block: usize,
) where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
// Extract the center of the output domain
let center_shape = end.simd_shape().map(|dim| dim - 2);
let center = s![1..=center_shape[0], 1..=center_shape[1]];
let mut end_u_center = end.u.slice_mut(center);
let mut end_v_center = end.v.slice_mut(center);
// Translate the element-based sequential iteration granularity into a
// row-based granularity.
let min_rows_per_task = opts
.min_elems_per_parallel_task
.div_ceil(end_u_center.ncols() * SIMD_WIDTH);
// Run the sequential simulation
if end_u_center.nrows() > min_rows_per_task {
// TODO: Run the simulation in parallel
} else {
// Run the simulation sequentially
update_seq(
opts,
start.u.view(),
start.v.view(),
end_u_center,
end_v_center,
cols_per_block,
);
}
}
/// Sequential update on a subset of the simulation domain
#[multiversion(targets("x86_64+avx2+fma", "x86_64+avx", "x86_64+sse2"))]
pub fn update_seq<const SIMD_WIDTH: usize>(
opts: &UpdateOptions,
start_u: ArrayView2<'_, Vector<SIMD_WIDTH>>,
start_v: ArrayView2<'_, Vector<SIMD_WIDTH>>,
mut end_u: ArrayViewMut2<'_, Vector<SIMD_WIDTH>>,
mut end_v: ArrayViewMut2<'_, Vector<SIMD_WIDTH>>,
cols_per_block: usize,
) where
LaneCount<SIMD_WIDTH>: SupportedLaneCount,
{
// Slice the output domain into vertical blocks for L1d cache locality
let num_blocks = end_u.ncols().div_ceil(cols_per_block);
let end_u = end_u.axis_chunks_iter_mut(Axis(1), cols_per_block);
let end_v = end_v.axis_chunks_iter_mut(Axis(1), cols_per_block);
// Iterate over output blocks
for (block_idx, (end_u, end_v)) in end_u.zip(end_v).enumerate() {
let is_last = block_idx == (num_blocks - 1);
// Slice up input blocks of the right width
let input_base = block_idx * cols_per_block;
let input_slice = if is_last {
Slice::from(input_base..)
} else {
Slice::from(input_base..input_base + cols_per_block + 2)
};
let start_u = start_u.slice_axis(Axis(1), input_slice);
let start_v = start_v.slice_axis(Axis(1), input_slice);
// Process current input and output blocks
for (win_u, win_v, out_u, out_v) in stencil_iter(start_u, start_v, end_u, end_v) {
// TODO: Same code as before
}
}
}
Once this is done, parallelizing the loop becomes a simple matter of implementing loop blocking as we did before, but across rows, and iterating over the blocks using Rayon parallel iterators instead of sequential iterators:
// Bring Rayon parallel iteration in scope
use rayon::prelude::*;
// Slice the output domain into horizontal blocks for parallelism
let end_u = end_u_center.axis_chunks_iter_mut(Axis(0), min_rows_per_task);
let end_v = end_v_center.axis_chunks_iter_mut(Axis(0), min_rows_per_task);
// Iterate over parallel tasks
let num_tasks = center_shape[0].div_ceil(min_rows_per_task);
end_u
.into_par_iter()
.zip(end_v)
.enumerate()
.for_each(|(task_idx, (end_u, end_v))| {
let is_last_task = task_idx == (num_tasks - 1);
// Slice up input blocks of the right height
let input_base = task_idx * min_rows_per_task;
let input_slice = if is_last_task {
Slice::from(input_base..)
} else {
Slice::from(input_base..input_base + min_rows_per_task + 2)
};
let start_u = start.u.slice_axis(Axis(0), input_slice);
let start_v = start.v.slice_axis(Axis(0), input_slice);
// Process the current block sequentially
update_seq(opts, start_u, start_v, end_u, end_v, cols_per_block);
});
Exercise
Integrate these changes into your codebase, then adjust the available tuning parameters for optimal runtime performance:
- First, adjust the number of threads that
rayon
uses via theRAYON_NUM_THREADS
environment variable. If the machine that you are running on has hyper-threading enabled, it is almost always a bad idea to use it on performance-optimized code, so using half the number of system-reported CPUs will already provide a nice speedup. And sincerayon
is not NUMA-aware yet, using more threads than the number of cores in one NUMA domain (which you can query usinglscpu
) may not be worthwhile. - Next, try to tune the
MIN_ELEMS_PER_PARALLEL_TASK
parameter. Runtime performance is not very sensitive to this parameter, so you will want to start with big adjustments by factors of 10x more or 10x less, then fine-tune with smaller adjustements once you find a region of the parameter space that seems optimal. Finally, adjust the defaults to your tuned value.
And with that, if we ignore the small wrinkle of cache blocking not yet working in the manner where we would expect it to work (which indicates that there is a bug in either our cache blocking implementation or our expectation of its impact), we have taken Gray-Scott reaction computation performance as far as Rust will let us on CPU.
Load balancing becomes vital for performance as soon as your system has CPU cores of heterogeneous processing capabilities like Arm’s big.LITTLE, Intel’s Adler Lake, and any CPU that has per-core turbo frequencies. But even on systems with homogeneous CPU core processing capabilities, load imbalance can dynamically occur as a result of e.g. interrupts from some specific hardware being exclusively processed by one specific CPU core. Therefore designing your program to allow for some amount of load balancing is a good idea as long as the associated task spawning and joining work does not cost you too much.
Next steps
This concludes our introduction to numerical computations in Rust. I hope you enjoyed it.
As you could see, the language is in a bit of an interesting spot as of 2024, where this course is written. Some aspects like iterators and multi-threading are much more advanced than in any other programming language, others like SIMD and N-d arrays are in a place that’s roughly comparable to the C++ state of the art, and then other things like GPU programming need more work, that in some cases is itself blocked on more fundamental ongoing language/compiler work (variadic generics, generic const expressions, specialization, etc).
What the language needs most today, however, are contributors to the library ecosystem. So if you think that Rust is close to your needs, and would be usable for project X if and only if it had library Y, then stop and think. Do you have the time to contribute to the existing library Y draft, or to write one from scratch yourself? And would this be more or less worthy of your time, in the long run, than wasting hours on programming languages that have a more mature library ecosystem, but whose basic design is stuck in an evolutionary dead-end?
If you think the latter, then consider first perfecting your understanding of Rust with one of these fine reference books:
- The Rust Programming Language is maintained by core contributors of the project, often most up to date with respect to language evolutions, and freely available online. It is, however, written for an audience that is relatively new to programming, so its pace can feel a bit slow for experienced practicioners of other programming languages.
- For this more advanced audience, the “Programming Rust” book by Blandy et al is a worthy purchase. It assumes the audience already has some familiarity with C/++, and exploits this to skip more quickly to aspects of the language that this audience will find interesting.
More of a fan of learning by doing? There are resources for that as well, like the Rustlings interactive tutorial that teaches you the language by making your fix programs that don’t compile, or Rust By Example which gives you lots of ready-made snippets that you can take inspiration from in your early Rust projects.
And then, as you get more familiar with the language, you will be hungry for more reference documentation. Common docs to keep close by are the standard library docs, Cargo book, language reference, and the Rustonomicon for those cases that warrant the use of unsafe code.
You can find all of these links and more on the language’s official documentation page at https://www.rust-lang.org/learn.
And that will be it for this course. So long, and thanks for all the fish!