Efficient routines for repeated bit permutations.
Permutations of n bits can be performed in O(log(n)) operations which reshuffle the individual bits in parallel. To have each reshuffling layer efficient, they have to be chosen from sets of operations which are efficient on the CPU. Precomputing these operations is slighly non-trivial, so this package may be useful only if you need to compute the application of a given permutation to a large number of words.
BitPermutations is a Julia Language package. To install this package, please open Julia's interactive session (known as REPL) and press the ] key in the REPL to use the package mode. Then type the following command
pkg> add BitPermutations
To define a permutation of the bits in type T
, construct a BitPermutation{T}
.
Here is an example with T = UInt8
.
Bits are ordered from LSB to MSB.
using BitPermutations
v = [2, 6, 5, 8, 4, 7, 1, 3]
p = BitPermutation{UInt8}(v)
we can then apply the permutation on any bitstring of type UInt8
by using bitpermute
.
If a functional approach is your jam, you can equivalently call the BitPermutation
directly
x = 0x08 # 0 0 0 1 0 0 0 0 (LSB -> MSB)
bitpermute(x, p) # 0 0 0 0 1 0 0 0, or 0x10
p(x) # Idem
To inspect the result, we can use bitstring
, or we can use a Bits
(defined by the package).
It is basically a faster BitVector
(defined in Base
), since its size is fixed (but is mutable).
bits = Bits(x)
[bits[v], Bits(bitpermute(x, p))]
The neat thing of the underlying network is that the inverse permutation can be computed at the same cost.
This can be performed using invbitpermute
or calling the adjoint permutation
invbitpermute(x, p) === p'(x)
[bits[invperm(v)], Bits(invbitpermute(x, p))]
Internally, a Vector
of bit masks and shifts are stored and then applied sequentially at each call to bitpermute
.
If you need to apply the permutation to an array of data, you can use the following syntax
xs = rand(T, 10)
p.(xs) # Regular permutation element-wise (also `bitpermute.(xs, p)`)
bitpermute(xs, p) # Idem
bitpermute!(p, xs) # Idem, but in-place
p'.(xs) # Inverse permutation element-wise (also `invbitpermute.(xs, p)`)
invbitpermute(xs, p) # Idem
invbitpermute!(xs, p) # Idem, but in-place
Internally, the broadcasted loop gets slices such that each stage is performed in parallel. This leads to a significant performance increase. As usual, if you do not need the original vector after the permutation step, the in-place version is faster, as it saves some allocations.
The functionalities mentioned above are best summarized in the plot below.
The results were obtained on an Intel Icelake processor running Julia 1.9.2 with --threads=1 --optimize=3 --check-bounds=no
.
To accurately measure the repeatedly perform the same permutation N
different times, similarly to
x = rand(T)
for _ in 1:N
x = bitpermute(x, p)
end
while the broadcasted permutation is performed on an array in an element-wise fashion
xs = rand(T, N)
bitpermute!(p, xs)
In both cases we choose N = 10_000
, and divide the median execution time by N
.
Considering single permutations, Beneš networks are consistently faster than doing naïve permutations of BitVector
s by a factor of 10, approximately.
As discussed later on, for types UInt32
and UInt64
, choosing a GRPNetwork
can lead to significant speedups on processors which support BMI2 instructions.
Indeed, this leads to a speedup of around 20 compared to BitVector
s, since it exploits advanced processor intrinsincs at each stage.
For other types however, the fallback implementation of these primitive operations is rather slow and should be avoided.
If you're lucky enough to have AVX-512 intrinsics available on your processor, then using the AVXCopyGather
is the method of choice for UInt16
, UInt32
and UInt64
, since it can perform arbitrary permutations in a few nanoseconds.
This benchmark was performed for random permutations, and is somewhat of a worst-case scenario, since each of these network has typically the most number of stages. If your permutations are not completely random but have some structure, it is possible that you might achieve even larger speedups.
The most dramatic speedup are, however, for array-wise permutations.
In this case a BenesNetwork
can permute bitstrings in a couple nanoseconds, yielding speedups of more than two orders of magnitude.
This is because the order of the operations is rearranged: the single layers of the network are not applied in sequence for each bit string, but are applied globally on each element.
This allows the processor to do the same operation over and over again, and potentially even use AVX instructions (although the same speedups are observed on Apple silicon).
I am unsure why the GRPNetwork
is not faster than BenesNetwork
for UInt32
and UInt64
: I believe the reason is that the compiler uses AVX instructions, but the PDEP/PEXT intrisics are not vectorized.
The AVXCopyGather
algorithm does not have a drastic improvement using broadcasting, since the single permutation is already exploiting vectorized operations, but it remains competitive compared to the other algorithms.
For UInt128
, the bitstrings exceed the processor word size, and everything is a bit slower.
The speedup will also depend on the choice of N
.
Your mileage may vary, especially depending on whether or not the array fits in cache.
The full benchmark routine can be found in benchmark/benchmarks.jl
, while the script for plotting the results is benchmark/plot.jl
.
For a more in-depth introduction to performing bit permutations, the wonderful https://programming.sirrida.de/bit_perm.html is well worth reading.
Three different backends to perform the permutation are implemented: rearranged Beneš networks, GRP networks, and AVX-512 bit shuffles.
The latter two are faster only because they exploit CPU instrisics, special operations that are available on certain x86-64 processors, in particular the AVX-512 and BMI2 instruction set.
If the AVX-512 instruction set is detected, a AVXCopyGather{T}
is used by default if T
is 16, 32, or 64 bits long.
Otherwise, if the BMI2 instruction set is detected, and T
is 32 or 64 bits long, a GRPNetwork{T}
is chosen
For all other cases, the default choice is a BenesNetwork{T}
.
The backend can be chosen manually when constructing the permutation, for example
p = BitPermutation{UInt32}(GRPNetwork, v)
Using intrinsics can be disabled by setting ENV["BIT_PERMUTATIONS_USE_INTRINSICS"] = false
before loading the package or setting
export BIT_PERMUTATIONS_USE_INTRINSICS=false
before launching Julia.
A Beneš network is a series of butterfly or delta-swap operations, in which each node can be swapped or not with the corresponding node shifted by a fixed amount δ. These operations are arranged in pairs, in a nested fashion, with the shifts chosen to be powers of two.
Above is an example of a network with 8 nodes, with 3 different stages (pairs of reshuffling) which have as δ respectively 4, 2, 1.
The two innermost operations can always be fused into a single one.
Somewhat remarkably, in this way one can perform any arbitrary permutation.
It is also relatively efficient, as each delta-swap should take around 6 cycles on modern processors.
The construction of the masks for the swaps is explained in: Donald E. Knuth, The Art of Computer Programming, Vol. 4, Fascicle 1A, (Addison-Wesley, Stanford, 2008), available as pre-fascicle here.
The general idea is to view each stage (pair of masks) as a repartition of the nodes into two sets.
We can then construct a graph in which each edge corresponds to the constraint of having two nodes in different partitions, both on the input and output side.
If we can 2-color the graph, we can then route each node to their corresponding partition, or color.
Fortunately, the graph is bipartite and it is very easy to do so: we iterate through the cycles of the graph and assign alternating colors to the nodes we visit.
One would wish to optimize the network in such a way that most masks are trivial (i. e. no swaps).
Unfortunately I do not know of any other way that exhaustively checking all possible log(n)! arrangements.
This can be disabled by passing the keyword argument rearrage=false
to the constructor.
GRP networks work in a similar way to Beneš network, except that each layer is a different reshuffling, known as GRP or sheeps-and-goats (see also TAOCP).
GRP networks are typically shallower, but the reshuffling operation is only efficient if specific instructions are available in hardware, as they can be performed in 8 cycles.
The PEXT/PDEP instructions used for the GRP reshuffling is supported by Intel starting from the Haswell processors (released in 2013) and by AMD from the Zen 3 generation (released in 2020).
On older AMD architectures, PEXT/PDEP is implemented in microcode and are reportedly slower.
On such machines you may want to experiment which method is faster and possibly disable calls to BMI2 with ENV["BIT_PERMUTATIONS_USE_INTRINSICS"] = false
.
Fallback operations are implemented but are typically much slower then butterfly operations.
The construction of the masks follows the algorithm in: R. Lee, Z. Shi, X. Yang, Efficient permutation instructions for fast software cryptography, IEEE Micro (2001); which is well explained here.
As mentioned in this blog post, new x86-64 processors have very powerful built-in instructions that can perform arbitrary bit shuffles.
These types of operations exploit SIMD-vectorized operations on 128-, 256-, or 512-bit registers.
By duplicating the data to permute, we can then extract the bits we need using a mask.
Despite the copying overhead, this turns out to be the fastest options for UInt16
, UInt32
and UInt64
.
Unfortunately such operations are consistently available only from Intel's Icelake generation (starting from 2019).
Several improvements could be made. Here I just list the first ones off the top of my head:
- Use ARM64-specific instructions It would be nice to improve performance on ARM64 processors by exploiting intrinsics for these processors, specifically Apple silicon. In SVE there is some kind of
PEXT/PDEP
equivalent here, as well as GRP operations, but unfortunately they don't seem to be supported by Apple. There are also some interesting possibilities described here. - Preconditioning A simple way of reducing the depth of the network is to try cheap operations like
bitreverse
andbitshift
before or after the network to try to save masks. This is what is done here. - Lookup tables Small permutations could benefit from doing sub-permutations with a precomputed table. One could use
pext
to extract say 8 bits at a time, look up their permutation in a table of 256 elements, and join the results with|
. This approach is fast but scales linearly with size, both in time and space, so it is interesting for permutations onUInt8
s,UInt16
s and possiblyUInt32
s. - PSHUFB The PSHUFB instruction is part of SSE3 and can perform arbitrary byte reshufflings. It could be used to compress a bunch of layers or combined with lookup tables for some very promising speedups.
- Rearrangement Finding the best possible arrangement of the stages of a Beneš network (such that the most masks are trivial) is currently done by exhaustive search. It is likely there is a better way of constructing the masks, but I am not aware of any.
- Better fallbacks Faster software fallbacks for
pext/pdep
exist,like zp7(uses x86 intrinsics anyway). - Code refactoring It should be possible to take a more functional approach and define a permutation as a series of transformations
T
↦T
, but I'm not sure how to do that while preserving type inference and performance. This would allow for more generic algorithms and extensions.
This package is compatible with Julia 1.6 and above.