Implementation of flexible (and thread-safe) temporary arrays and array pools for situations where a little bit of semi-manual memory management improves performance. Used quite heavily throughout the ACE codebase, can lead to significant performance gains in some cases. Unfortunately, at this point, those gains are not always as systematic as one would hope.
The following Table shows a basic benchmark for evaluating a Chebyshev basis, for multiple inputs at the same time. This is a typical use-case for which this package is intended: the cost of arithmetic is on the same order of magnitude as the cost of allocation.
nB / nX | 10 / 16 | 10 / 32 | 30 / 16 | 30 / 32 |
---|---|---|---|---|
Array | 147 / 259 | 163 / 566 | 377 / 876 | 412 / 1286 |
pre-allocated | 89 / 97 | 65 / 66 | 253 / 263 | 213 / 214 |
FlexArray | 95 / 100 | 63 / 63 | 264 / 273 | 207 / 213 |
ArrayPool(FlexArray) | 91 / 93 | 68 / 70 | 264 / 271 | 216 / 223 |
FlexArrayCache | 104 / 106 | 88 / 94 | 280 / 287 | 270 / 283 |
ArrayPool(FlexArrayCache) | 111 / 112 | 93 / 98 | 285 / 292 | 275 / 287 |
TSafe(FlexArray) | 87 / 89 | 67 / 68 | 262 / 269 | 212 / 219 |
TSafe(ArrayPool(FlexArray)) | 96 / 97 | 74 / 77 | 262 / 271 | 224 / 232 |
ObjectPools.jl
exports FlexArray
which can be used to keep memory for an array and adapt its type and size as needed. In particular the eltype and size can change at runtime without performance loss. They are constructed as follows:
tmp = FlexArray()
This stores a resizable array that can be obtained via
A = acquire!(tmp, (N,), Float64) # N = length of array
A = acquire!(tmp, (10, 10, 10), Bool)
The object tmp
actually stores a Vector{UInt8}
which is converted into a PtrArray
and then re-interpreted and reshaped at essentially zero-cost.
ObjectPools.jl
exports FlexArrayCache
, which provides stacks of arrays to reuse without garbage collection. This can be thought of as a very limited and manual re-implementation of garbage collection. They are used as follows:
cache = FlexArrayCache()
A = acquire!(cache, (N, ), Float64)
# do something with A
release!(A)
The acquire!
function obtains an array of size (N,)
from the stack (in the current thread). After the array is no longer needed, it can be returned to the stack via release!
. It is ok if it is never released. Once there is no longer a reference to A
, it will just be garbage collected.
One can also use the unwrap
function to get the PtrArray
of a FlexCachedArray
or adjoint/transpose of a FlexCachedArray
:
cache = FlexArrayCache()
A = acquire!(cache, (M, N), Float64)
Aptr = unwrap(A) # PtrArray of size (M, N)
At = A' # Adjoint of A in FlexCachedArray
Atptr = unwrap(At) # PtrArray of At with size (N, M)
Warning: Use of parent
to obtain the PtrArray
of a FlexCachedArray
is deprecated. Always use unwrap
instead.
A pool is a dictionary of temporary arrays or array caches indexed by symbols. It enables the management of many temporary arrays (or caches) within a single field. For example,
pool = ArrayPool(FlexArray)
A = acquire!(pool, :A, (10, 10), Float64)
B = acquire!(pool, :B, (10, 100), ComplexF64)
One can similarly create a ArrayPool(FlexArrayCache)
In multi-threaded code it can become important that each thread uses its own temporary work array. This can be achieved by wrapping a FlexArray
or FlexArrayCache
or an ArrayPool
into a TSafe
, e.g.
tmp = TSafe(ArrayPool(FlexArrayCache))
We can now access this as follows:
@threads for n = 1:N
A = acquire!(tmp, :A, (10,10), SVector{3, Float64})
# do something with A
release!(A)
end
Here, tmp
actually stores a separate ArrayPool
for each thread. Note that due to the dynamic scheduler it is possible that an array A
is aquired in thread i
and released in thread j
in which case it is released back to a different stack.
Note that due to the dynamics scheduler, TSafe(FlexArray)
is NOT entirely thread-safe. These arrays are only thread safe when using the static scheduler, e.g.
tmp = TSafe(FlexCache)
@threads :static for i = 1:10
A = acquire!(tmp, (20, 30, 5), ComplexF32)
# do something with A
end
The following example contains code that is not intended to run, but only indicative:
The simplest use-case of ObjectPools.jl
is to have flexible temporary variables and output arrays that can be reused. For example, suppose we want to evaluate the spherical harmonics. This could be implemented as follows
struct Ylms
L::Int
tmpP::FlexArray
outY::FlexArrayCache
end
Ylms(L::Integer) = Ylms(L, FlexArray(), FlexCacheArray())
function (ylms::Ylms)(r::SVector{3, T}) where {T <: Real}
L = ylms.L
P = acquire!(ylms.tmpP, (lenP(L),), T)
eval_alp!(P, r) # not shown
Y = acquire!(ylms.outY, (lenY(L),), Complex{T})
eval_ylm!(Y, P, r) # not shown
return Y
end
ylms = Ylms(L)
for i = 1:niter
r = @SVector randn(3) # generate an input somehow
Y = ylms(r) # evaluate the Ylms
# .... do something with Y
release!(Y) # return it to the pool
end
The first advantage of the above implementation is that the input type parameter T
need not be known at any point other than runtime. E.g., we can now use ForwardDiff
to differentiate the basis and the FlexArray
s will just become arrays of Dual
numbers.
The second advantage is that the output array gets released back to the array cache and is not newly allocated at each step. Of course one could instead pre-allocate and write an in-place version of the evaluation code. But this requires type management outside of the Ylms implementation, which can get tedious. The FlexArrayCache
is a simple mechanism to keep all type management localized to the actual implementation.
If we wanted to make the for i = 1:niter
loop multi-threaded then we could rewrite this code as follows:
struct Ylms
L::Int
tmpP::TSafe{FlexArray}
outY::TSafe{FlexArrayCache}
end
ylms = Ylms(L)
@threads :static for i = 1:niter
r = @SVector randn(3) # generate an input somehow
Y = ylms(r) # evaluate the Ylms
# .... do something with Y
release!(Y) # return it to the pool
end
We use the static scheduler because TSafe{FlexArray}
is not safe to use with the dynamic scheduler.
To use the dynamic scheduler we need to swap it for a TSafe{FlexArrayCache}
:
struct Ylms
L::Int
tmpP::TSafe{FlexArrayCache}
outY::TSafe{FlexArrayCache}
end
ylms = Ylms(L)
@threads for i = 1:niter
r = @SVector randn(3) # generate an input somehow
Y = ylms(r) # evaluate the Ylms
# .... do something with Y
release!(Y) # return it to the pool
end