Linear and logarithmic quantization for Julia arrays
Author milankl
1 Star
Updated Last
1 Year Ago
Started In
March 2021



Linear and logarithmic quantisation for Julia arrays into 8, 16, 24 or 32-bit. Quantisation is a lossy compression method that divides the range of values in an array in equi-distant quantums and encodes those from 0 to 2^n-1 where n is the number of bits available. The quantums are either equi-distant in linear space or in logarithmic space, which has a denser encoding for values close to the minimum in trade-off with a less dense encoding close to the maximum.

Linear quantization takes values in (-∞,∞) (no NaN or Inf) logarithmic quantization is only supported for values in [0,∞).

Usage: Linear quantization

Linear quantisation of n-dimensional arrays (any number format that can be converted to Float64 is supported, including Float32, Float16) into 8, 16, 24 or 32 bit is achieved via

julia> A = rand(Float32,1000)
julia> L = LinQuant8Array(A)
1000-element LinQuantArray{UInt8,1}:

and similarly with LinQuant16Array, LinQuant24Array, LinQuant32Array. Decompression via

julia> Array(L)
1000-element Array{Float32,1}:

Array{T}() optionally takes a type parameter T such that decompression to other number formats than the default Float32 is possible involves a rounding error which follows a round-to-nearest in linear space.

Logarithmic quantisation

In a similar way, LogQuant8Array, LogQuant16Array, LogQuant24Array, LogQuant32Array compresses an n-dimensional array (non-negative elements only) via logarithmic quantisation.

julia> A = rand(Float32,100,100)
julia> A[1,1] = 0
julia> L = LogQuant16Array(A)
100×100 LogQuantArray{UInt16,2}:
 0x0000  0xf22d  0xfdf6  0xf3e8  0xf7750xe3dc  0xfdc0  0xedb5  0xed47  0xee5b     
 0xde3d  0xbe58  0xb541  0xf573  0x9885     
 0xf38b  0xfefe  0xea2f  0xfbb6  0xf0d2     
 0xd0d2  0xfe1f  0xff60  0xf6cd  0xec26        
 0xffa6  0xe621  0xf14d  0xfb2c  0xf50c0xfcb7  0xe6fb  0xf237  0xecd5  0xfb0a     
 0xe4ed  0xf86f  0xf83d  0xff86  0xb686     
      ⋮                                  ⋱

Exception occurs for 0, which is mapped to 0x0. Ox1 to 0xff...ff are then the available bitpatterns to encode the range from minimum(A) to maximum(A) logarithmically. By default the rounding mode for logarithmic quantisation is round-to-nearest in linear space. Alternatively, a second argument can be either :linspace or :logspace, which allows for round-to-nearest in logarithmic space. Decompression as with linear quantisation via the Array() function.


To compress an array A, the minimum and maximum is obtained

Amin = minimum(A)
Amax = maximum(A)

which allows the calculation of Δ, the inverse of the spacing between two quantums

Δ = 2^(n-1)/(Amax-Amin)

where n is the number of bits used for quantisation. For every element a in A the corresponding quantum q which is closest in linear space is calculated via

q = T(round((a-Amin)*Δ))

where round is the round-to-nearest function for integers and T the conversion function to 24-bit unsigned integers UInt24 (or UInt8, UInt16 for other choices of n). Consequently, an array of all q and Amin,Amax have to be stored to allow for decompression, which is obtained by reversing the conversion from a to q. Note that the rounding error is introduced as the round function cannot be inverted.

Logarithmic quantisation distributes the quantums logarithmically, such that more bitpatterns are reserved for values close to the minimum and fewer close to the maximum in A. Logarithmic quantisation can be generalised to negative values by introducing a sign-bit, however, we limit our application here to non-negative values. We obtain the minimum and maximum value in A as follows

Alogmin = log(minpos(A))
Alogmax = log(maximum(A))

where zeros are ignored in the minpos function, which instead returns the smallest positive value. The inverse spacing Δ is then

Δ = 2^(n-2)/(logmax-logmin)

Note, that only 2^(n-1) (and not 2^n as for linear quantisation) bitpatterns are used to resolve the range between minimum and maximum, as we want to reserve the bitpattern 0x000000 for zero. The corresponding quantum q for a A is then

q = T(round(c + Δ*log(a)))+0x1

unless a=0 in which case q=0x000000. The constant c can be set as -Alogmin*Δ such that we obtain essentially the same compression function as for linear quantisation, except that every element a in A is converted to their logarithm first. However, rounding to nearest in logarithmic space will therefore be achieved, which is a biased rounding mode, that has a bias away from zero. We can correct this round-to-nearest in logarithmic space rounding mode with

c = 1/2 - Δ*log(minimum(A)*(exp(1/Δ)+1)/2)

which yields round-to-nearest in linear space. See next section.

Round to nearest in linear or logarithmic space

For a logarithmic integer system with base b (i.e. only 0,b,b²,b³,... are representable), for example, we have

log_b(1) = 0
log_b(b) = 0.5
log_b(b) = 1
log_b(b³) = 1.5
log_b(b²) = 2

such that q*√b is always halfway between two representable numbers q,q2 in logarithmic space, which will be the threshold for round up or down in the round function. q*√b is not halfway in linear space, which is always at q + (q*b - q)/2. For simplicity we can set q=1, and for b=2 we find that

2 = 1.41... != 1.5 = 1 + (2-1)/2

Round-to-nearest in log-space therefore rounds the values between 1.41... and 1.5 to 2, which will introduce an away-from-zero bias. As halfway in log-space is reached by multiplication with √b, this can be corrected to halfway in linear space by adding a constant c_b in log-space, such that conversion from halfway in linear space, i.e. 1+(b-1)/2 should yield halway in log-space, i.e. 0.5

c_b + log_b(1+(b-1)/2) = 0.5

So, for b=2 we have c_b = 0.5 - log2(1.5) ≈ -0.085. Hence, a small number will be subtracted before rounding is applied to reduce the away-from-zero bias.

Figure A1. Schematic to illustrate round-to-nearest in linear vs logarithmic space for logarithmic number systems.

We now generalise the logarithmic system, such that the distance dlog = 1/Δ between two representable numbers (i.e. quantums) is not necessarily 1 (in log-space) and we allow for an offset as done in the logarithmic quantisation. Let min be the offset (i.e. the minimum of the uncompressed array) and dlin the spacing between the first two representable quantums min,q2. Then the logarithm of halfway in linear space, log_b(min + dlin/2), should map to 0.5.

c_b + (log_b(min + dlin/2) - log_b(min))/dlog = 0.5

With dlin = b^(log_b(min) + dlog) - min this can be transformed into

c_b = 1/2 - 1/dlog*log_b((b^dlog + 1)/2)

and combined with the offset correction -log_b(min)*Δ to form either

c = -log(min)*Δ,   (round-to-nearest in log-space)
c = 1/2 - Δ*log(minimum(A)*(exp(1/Δ)+1)/2)    (round-to-nearest in linear-space)

with b = ℯ, so that only the natural logarithm has to be computed for every element in the uncompressed array.


Approximate throughputs are (via @btime)

Method 8 bit 16 bit 24 bit 32 bit
compression 1350 MB/s 1350 MB/s 50 MB/s 1350 MB/s
decompression 4700 MB/s 4700 MB/s 4000 MB/s 3600 MB/s
compression 285 MB/s 285 MB/s 40 MB/s 285 MB/s
decompression 250 MB/s 250 MB/s 250 MB/s 500 MB/s

24-bit quantisation is via UInt24 from the BitIntegers package, which introduces a drastic slow-down.


LinLogQuantization.jl is registered, so just do

julia>] add LinLogQuantization

where ] opens the package manager

Used By Packages

No packages found.