A number format that you can count with your fingers.
Author milankl
2 Stars
Updated Last
11 Months Ago
Started In
February 2020

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Travis AppVeyor Cirrus CI


Finally a number type that you can count with your fingers. Super Mario and Zelda would be proud.

Comes in two flavours: Float8 has 3 exponent bits and 4 fraction bits, Float8_4 has 4 exponent bits and 3 fraction bits. Both rely on conversion to Float32 to perform any arithmetic operation, similar to Float16.

CAUTION: Float8_4(::Float32) currently contains a bug for subnormals.

Example use

julia> using Float8s

julia> a = Float8(4)

julia> b = Float8(3.14159)

julia> a+b

julia> sqrt(a)

julia> a^2

Most arithmetic operations are implemented. If you would like to have an additional feature, raise an issue.


Float8s.jl is registered, just do

(v1.3) pkg> add Float8s


Conversions from Float8 (same for Float8_4) to Float32 take about 1.5ns, about 2x faster than for conversions from Float16 thanks to table lookups.

julia> using Float8s, BenchmarkTools
julia> A = Float8.(randn(300,300));
julia> B = Float16.(randn(300,300));
julia> @btime Float32.($A);
  133.060 μs (2 allocations: 351.64 KiB)
julia> @btime Float32.($B);
  232.701 μs (2 allocations: 351.64 KiB)

Conversions in the other direction are about 6-7x slower and slightly slower than for Float16.

julia> C = Float32.(randn(300,300));
julia> @btime Float16.($C);
  672.419 μs (2 allocations: 175.89 KiB)
julia> @btime Float8.($C);
  873.585 μs (2 allocations: 88.02 KiB)