BatchedBLAS.jl

Nvidia only provides batched versions of the BLAS gemv, gemm, and trsm functions. Further, they only support floating point, and alpha and beta can only be scalars. BatchedBLAS.jl extends support for batched arrays by (currently) providing batched versions of dot, gemv, symv, spmv, ger, syr, and spr that work with arrays of AbstractFloats and Integers, and scaling coefficients which can be scalars or Vectors.

In addition to the type flexibility, there is a performance benefit for rank-1 updates as execution times for ger, syr, and spr are faster than the equivalent batched gemm for the range of parameters tested. dot is also faster for small matrices. Benchmarks on an H100 follow. The dashed lines are for the transposed version of gemv and the upper-triangle versions of all other functions. Lower numbers are better.

Example usage:

julia> using CUDA, SymmetricFormats, BatchedBLAS

julia> L = 4  # the matrix size
4

julia> N = 6  # the batch dimension
6

julia> A = reshape(1:L*L*N, L, L, N)
4×4×6 reshape(::UnitRange{Int64}, 4, 4, 6) with eltype Int64:
[:, :, 1] =
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

[:, :, 2] =
 17  21  25  29
 18  22  26  30
 19  23  27  31
 20  24  28  32

[:, :, 3] =
 33  37  41  45
 34  38  42  46
 35  39  43  47
 36  40  44  48

[:, :, 4] =
 49  53  57  61
 50  54  58  62
 51  55  59  63
 52  56  60  64

[:, :, 5] =
 65  69  73  77
 66  70  74  78
 67  71  75  79
 68  72  76  80

[:, :, 6] =
 81  85  89  93
 82  86  90  94
 83  87  91  95
 84  88  92  96

julia> SP = CuArray{Float64}(undef, packedsize(A), N);

julia> for i in 1:N
           SP[:,i] = SymmetricPacked(view(A,:,:,i), :U).tri
       end

julia> SP
10×6 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
  1.0  17.0  33.0  49.0  65.0  81.0
  5.0  21.0  37.0  53.0  69.0  85.0
  6.0  22.0  38.0  54.0  70.0  86.0
  9.0  25.0  41.0  57.0  73.0  89.0
 10.0  26.0  42.0  58.0  74.0  90.0
 11.0  27.0  43.0  59.0  75.0  91.0
 13.0  29.0  45.0  61.0  77.0  93.0
 14.0  30.0  46.0  62.0  78.0  94.0
 15.0  31.0  47.0  63.0  79.0  95.0
 16.0  32.0  48.0  64.0  80.0  96.0

julia> x = CuArray{Float64}(reshape(1:L*N, L, N))
4×6 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 1.0  5.0   9.0  13.0  17.0  21.0
 2.0  6.0  10.0  14.0  18.0  22.0
 3.0  7.0  11.0  15.0  19.0  23.0
 4.0  8.0  12.0  16.0  20.0  24.0

julia> batched_spr!('U', 1.0, x, SP)

julia> SP
10×6 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
  2.0  42.0  114.0  218.0  354.0  522.0
  7.0  51.0  127.0  235.0  375.0  547.0
 10.0  58.0  138.0  250.0  394.0  570.0
 12.0  60.0  140.0  252.0  396.0  572.0
 16.0  68.0  152.0  268.0  416.0  596.0
 20.0  76.0  164.0  284.0  436.0  620.0
 17.0  69.0  153.0  269.0  417.0  597.0
 22.0  78.0  166.0  286.0  438.0  622.0
 27.0  87.0  179.0  303.0  459.0  647.0
 32.0  96.0  192.0  320.0  480.0  672.0

julia> SymmetricPacked(SP[:,1])
4×4 SymmetricPacked{Float64, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}:
  2.0   7.0  12.0  17.0
  7.0  10.0  16.0  22.0
 12.0  16.0  20.0  27.0
 17.0  22.0  27.0  32.0

julia> SymmetricPacked(A[:,:,1] .+ Array(x[:,1]*transpose(x[:,1])))
4×4 SymmetricPacked{Float64, Matrix{Float64}}:
  2.0   7.0  12.0  17.0
  7.0  10.0  16.0  22.0
 12.0  16.0  20.0  27.0
 17.0  22.0  27.0  32.0

BatchedBLAS.jl

Required Packages

Used By Packages

Suggest Category

Required Packages

Used By Packages

Julia Packages