STREAMBenchmark

Getting a realistic estimate of the achievable (maximal) memory bandwidth

Note: This package implements a simple variant of the original STREAM benchmark. There also is BandwidthBenchmark.jl, which is a variant of TheBandwidthBenchmark.

`memory_bandwidth()`

The function memory_bandwidth() estimates the memory bandwidth in megabytes per second (MB/s). It returns a named tuple indicating the median, minimum, and maximum of the four measurements.

A few important remarks upfront:

To obtain a reasonable estimate you should start julia with enough threads (e.g. as many as you have physical cores).
You should play around with the length of the vectors, used in the streaming kernels, via the keyword argument N. Make it large enough (e.g. # of NUMA nodes times four times the size of the outermost cache size) in particular if you get unreasonably high bandwidths.
If possible, you should pin the Julia threads to separate cores. The simplest ways to pin N Julia threads to the first N cores (compact pinning) are 1) settings JULIA_EXLUSIVE=1 or 2) using ThreadPinning.jl's pinthreads(:compact). We will use the latter below.

julia> using ThreadPinning

julia> pinthreads(:compact)

julia> using STREAMBenchmark

julia> memory_bandwidth(verbose=true)
╔══╡ Multi-threaded:
╠══╡ (10 threads)
╟─ COPY:  100205.2 MB/s
╟─ SCALE: 100218.7 MB/s
╟─ ADD:   100364.7 MB/s
╟─ TRIAD: 100293.1 MB/s
╟─────────────────────
║ Median: 100255.9 MB/s
╚═════════════════════
(median = 100255.9, minimum = 100205.2, maximum = 100364.7)

Keyword arguments

N (default STREAMBenchmark.default_vector_length()): length of the vectors used in the streaming kernels
nthreads (default Threads.nthreads()): Use nthreads threads for the benchmark. It must hold 1 ≤ nthreads ≤ Threads.nthreads().
write_allocate (default: true): assume the use / count write allocates.
verbose (default: false): verbose output, including the individual results of the streaming kernels.

`benchmark()`

If you want to run both the single- and multi-threaded benchmark at once you can call benchmark() which produces an output like this:

julia> benchmark()
╔══╡ Single-threaded:
╟─ COPY:  18880.8 MB/s
╟─ SCALE: 18537.2 MB/s
╟─ ADD:   17380.2 MB/s
╟─ TRIAD: 17359.9 MB/s
╟─────────────────────
║ Median: 17958.7 MB/s
╚═════════════════════

╔══╡ Multi-threaded:
╠══╡ (10 threads)
╟─ COPY:  100358.1 MB/s
╟─ SCALE: 100218.2 MB/s
╟─ ADD:   99508.0 MB/s
╟─ TRIAD: 99582.4 MB/s
╟─────────────────────
║ Median: 99900.3 MB/s
╚═════════════════════

(single = (median = 17958.7, minimum = 17359.9, maximum = 18880.8), multi = (median = 99900.3, minimum = 99508.0, maximum = 100358.1))

Scaling

Number of threads

To assess the scaling of the maximal memory bandwidth with the number of threads, we provide the function scaling_benchmark()

julia> y = scaling_benchmark()
# Threads: 1	Max. memory bandwidth: 19058.7
# Threads: 2	Max. memory bandwidth: 37511.2
# Threads: 3	Max. memory bandwidth: 55204.6
# Threads: 4	Max. memory bandwidth: 68706.6
# Threads: 5	Max. memory bandwidth: 76869.9
# Threads: 6	Max. memory bandwidth: 83669.9
# Threads: 7	Max. memory bandwidth: 88656.0
# Threads: 8	Max. memory bandwidth: 93701.0
# Threads: 9	Max. memory bandwidth: 97093.6
# Threads: 10	Max. memory bandwidth: 101293.9
10-element Vector{Float64}:
  19058.7
  37511.2
  55204.6
  68706.6
  76869.9
  83669.9
  88656.0
  93701.0
  97093.6
 101293.9
 
julia> using UnicodePlots

julia> lineplot(1:length(y), y, title = "Bandwidth Scaling", xlabel = "# cores", ylabel = "MB/s", border = :ascii, canvas = AsciiCanvas)

                            Bandwidth Scaling
               +----------------------------------------+
        110000 |                                        |
               |                                   __r-*|
               |                            __--"""     |
               |                      __-*""            |
               |                 ._-*"                  |
               |              .r*"                      |
               |           .r"`                         |
   MB/s        |         .*'                            |
               |       ./`                              |
               |      .'                                |
               |    ./                                  |
               |  .r`                                   |
               | ./                                     |
               |*`                                      |
         10000 |                                        |
               +----------------------------------------+
                1                                     10
                                 # cores

Vector length

By default a vector length of four times the size of the outermost cache is used (a rule of thumb "laid down by Dr. Bandwidth"). To measure the memory bandwidth for a few other factorsas well you might want to use STREAMBenchmark.vector_length_dependence():

julia> STREAMBenchmark.vector_length_dependence()
1: 3604480 => 121692.2
2: 7208960 => 99755.5
3: 10813440 => 98705.5
4: 14417920 => 98660.5
Dict{Int64, Float64} with 4 entries:
  10813440 => 98705.5
  7208960  => 99755.5
  3604480  => 1.21692e5
  14417920 => 98660.5

Comparison with original STREAM benchmark

We can download and compile the C source code of the original STREAM benchmark via STREAMBenchmark.jl:

julia> using STREAMBenchmark

julia> STREAMBenchmark.download_original_STREAM()
- Creating folder "stream"
- Downloading C STREAM benchmark
- Done.

julia> STREAMBenchmark.compile_original_STREAM(compiler=:gcc, multithreading=false)
- Trying to compile "stream.c" using gcc
  Using options: -O3 -DSTREAM_ARRAY_SIZE=14417920
- Done.

julia> STREAMBenchmark.execute_original_STREAM()
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 14417920 (elements), Offset = 0 (elements)
Memory per array = 110.0 MiB (= 0.1 GiB).
Total memory required = 330.0 MiB (= 0.3 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 11047 microseconds.
   (= 11047 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           11039.8     0.020987     0.020896     0.021092
Scale:          12491.1     0.018509     0.018468     0.018537
Add:            13370.0     0.025934     0.025881     0.026183
Triad:          13396.9     0.025903     0.025829     0.026223
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

julia> memory_bandwidth(verbose=true, nthreads=1, write_allocate=false) # the original benchmark doesn't count / assumes the absence of write-allocates
╔══╡ Single-threaded:
╠══╡ (1 threads)
╟─ COPY:  12749.1 MB/s
╟─ SCALE: 12468.2 MB/s
╟─ ADD:   13095.3 MB/s
╟─ TRIAD: 13131.2 MB/s
╟─────────────────────
║ Median: 12922.2 MB/s
╚═════════════════════
(median = 12922.2, minimum = 12468.2, maximum = 13131.2)

Further Options and Comments

LoopVectorization

You can make STREAMBenchmarks.jl use LoopVectorization's @avxt instead of @threads by setting STREAMBenchmark.avxt() = true. Note, however, that this only works if nthreads=1 (single thread is used) or nthreads=Threads.nthreads() (all threads are used). This because @avxt isn't compatible with our way to let the benchmark only run on a subset of the available Julia threads.

Thread pinning

It is recommended to either set the environmental variable JULIA_EXCLUSIVE = 1 or use pinthreads(:compact) from ThreadPinning.jl to pin the used Julia threads to the first 1:nthreads cores.

See https://discourse.julialang.org/t/thread-affinitization-pinning-julia-threads-to-cores/58069 for a discussion of other options like numactl (with caveats).

Resources

Original STREAM benchmark (C/Fortran): https://www.cs.virginia.edu/stream/
Blog post about how to optimize and interpret the benchmark: https://blogs.fau.de/hager/archives/8263

Acknowledgements

CI infrastructure is provided by the Paderborn Center for Parallel Computing (PC²)