The abstractness (or lack thereof) of fallback implementations of array operations in Base Julia left me unsatisfied. Some of the problems we have:

- Fallback methods of array operations assume the presence of an efficient elementwise access (
`getindex`

) operation (either`IndexLinear`

or`IndexCartesian`

). It is hard to reconcile this abstraction for distributed / blocked arrays, resulting in DArray implementations having to wrap every operation on AbstractArray. Wrapping some of these operations is trivial (e.g.`map`

and`reduce`

) but some are not trivial to wrap (e.g.`reducedim`

,`broadcast`

). Further, the wrappers need to be constantly updated to keep in sync with Base's set of features, across Julia versions. - Operations often involve similar boiler-plate code for dimensionality checks and reflection to find output array type and dimensions.
- Not all operations are optimized for memory locality, those that are have different implementations - thus leading to more complex code that strictly need not exist.

This package aims to evolve the means available to express array operations at a higher level than currently possible. The basic idea is if you get the general `@arrayop`

case working for a new array type then the implementations of many array operations would fall out of it (see next section to learn about the `@arrayop`

macro). `@arrayop`

is also higher level than elementwise access, so distributed arrrays can implement it efficiently.

Hypothetically, if the `@arrayop`

macro was moved to Base and array operations in Base like `broadcast`

and `reducedim`

were implemented in Base using it, then

- We can delete a lot of array code from
`Base`

and replace them with much simpler`@arrayop`

expressions - Complex array types like
`DArray`

, for example, wouldn't have to wrap each operation, they just need to get`@arrayop`

working once, and operations defined in Base will work for that array type. (This package has a Dagger.jl array implementation as an example of this.) - Make optimizations (like multi-threading, memory-locality) that may speed up many operations at once across the whole array ecosystem.

The `@arrayop`

macro can express array operations by denoting how the dimensions of the input arrays interact to produce dimensions of the output array. The notation is similar to Einstein notation (or its equivalent in TensorOperations.jl see below for a comparison) with some added features to support more operations. The notation is best described by some examples:

```
X = collect(reshape(1:12, 4,3))
Y = ones(3,4)
# transpose (equivalent to X')
@arrayop Z[i,j] := X[j,i]
# elementwise 1-arg (equivalent to sin.(X))
@arrayop Z[i,j] := sin(X[i,j])
# elementwise 2-args (equivalent to X .+ Y')
@arrayop Z[i,j] := X[i,j] + Y[j,i]
# elementwise with a constant (equivalent to X .+ (im .* Y'))
@arrayop Z[i,j] := X[i,j] + im * Y[j,i]
# reduce (default +-reduce). Note: returns a 0-dimensional array
# NOTE: any dimension that is left out in the LHS of the expression
# is reduced as in Einsten notation. By default + is used as the reducer.
@arrayop Z[] := X[i,j]
# reduce with a user-specified function, * in this case.
@arrayop Z[] := X[i,j] (*)
# reducedim (defaults reducer is +)
@arrayop Z[1, j] := X[i,j] # equivalent to reducedim(+, X, 1)
@arrayop Z[i, 1] := X[i,j] # equivalent to reducedim(+, X, 2)
@arrayop Z[i] := X[i,j] # equivalent to squeeze(reducedim(+, X, 2)) / APL-style reducedim
# reducedim with a user-specified function
@arrayop Z[1, j] := X[i,j] (*) # equivalent to prod(X, 1)
# broadcast
y = [1, 2, 3, 4]
@arrayop Z[i, j] := X[i, j] + y[i]
y = [1 2 3]
@arrayop Z[i, j] := X[i, j] + y[1, j]
# matmul
@arrayop Z[i, j] := X[i, k] * Y[k, j]
```

Note that these expressions generate the `for`

-loops to perform the required operation. They can hence be used to implement the array operations noted in the comments.

An expression like `@arrayop Z[i, j] = X[i,k] * Y[k,j]`

works in-place by overwriting `Z`

(notice that `=`

is used instead of `:=`

).

The same expressions currently work on both AbstractArrays and are specialized for efficiency to work on Dagger arrays.

The examples here are on 1 and 2 dimensional arrays but the notation trivially generalizes to N dimensions.

As an example of how this aids genericness, potentially, Base can define the `reducedim`

(for example) function on an n-dimensional array as:

```
@generated function reducedim{dim}(f, X::AbstractArray, ::Val{dim})
idx_in = Any[Symbol("i$n") for n=1:ndims(X)]
idx_out = copy(idx_in)
idx_out[dim] = 1
:(@arrayop Y[$(idx_out...)] := X[$(idx_in...)], $f)
end
```

Allowing it to work efficiently both on AbstractArrays and in a specialized way on Dagger's arrays (or another array which has a specialized implementation for `@arrayop`

).

`@arrayop`

uses TiledIteration.jl to perform operations in cache-efficient way. As a demo of this, consider a 3-dimensional `permutedims`

:

```
julia> x = rand(128,128,128);
julia> @btime @arrayop y[i,j,k] := x[k,j,i];
4.871 ms (16 allocations: 16.00 MiB)
julia> @btime permutedims(x, (3,2,1));
23.611 ms (10 allocations: 16.00 MiB)
```

Presumably the Base `permutedims`

doesn't make efforts to block the inputs, leading to many more cache misses than the `@arrayop`

version.

```
julia> @btime A+A';
3.752 ms (6 allocations: 15.26 MiB)
julia> @btime @arrayop _[i, j] := A[i,j] + A[j,i];
2.607 ms (22 allocations: 7.63 MiB)
```

This might open up opportunities to syntactically rewrite things like `A+A'`

to `@arrayop _[i,j] = A[i,j] + A[j,i]`

which is faster and allocates no temporaries. This also should speed up operations on `PermuteDimsArray`

.

The `@arrayop <expr>`

simply translates to `arrayop!(@lower <expr>)`

. The goal of `@lower`

is to lower the expression to type `Assign`

which contains an LHS and an RHS consisting of combination of `Indexing`

and `Map`

types.

`A[i, j]`

becomes`Indexing(A, IndexSym{:i}(), IndexSym{:j}())`

`A[i, 1]`

becomes`Indexing(A, IndexSym{:i}(), IndexConst{Int}(1))`

`f(A[i, j])`

becomes`Map(f, <lowering of A[i,j]>)`

`B[i] = f(A[i, j])`

becomes`Assign(<lowering of B[i]>, <lowering of f(A[i,j])>, +, reduction_identity(+, eltype(A)))`

on the RHS, and`Indexing(B, IndexSym{:i}())`

on LHS`B[i] := f(A[i, j])`

creates a similar`Assign`

but the LHS contains`Indexing(AllocVar{:B}, IndexSym{:i}())`

denoting an output array (named`B`

) should be allocated and then iterated.

You can try out a few expressions with the `@lower`

macro to see how they get lowered. These types for representing the lowered form of the expression are parameterized by the types of all their arguments allowing functions in the next steps to dispatch on and generate efficient code tailored to the specific expression and specific array types.

The lowered object got from the expression in Step 1 is passed to `arrayop!`

. By default `arrayop!(op)`

calls `arrayop!(<type of LHS array>, op)`

which is a way to dispatch to different implementations based on the type of array in the LHS. This is how `@arrayop`

is able to pick different implementations for normal arrays and Dagger arrays. `arrayop!(::Type{AllocVar}, op)`

is special and tries to allocate the LHS by calling `ArrayMeta.allocarray`

and then calls `arrayop!(<type of LHS>, op)`

.

The task of `arrayop!`

is to act as a generated function which returns the code that can perofm a given operation. The code for doing this on AbstractArrays is at `src/impl/abstract.jl`

. The Dagger implementation is at `src/impl/dagger.jl`

. It was possible to acheive the Dagger implementation without generating any loop expressions, and interestingly, only by rewriting the lowered form from Step 1 to a lowered form that act on the chunks of the DArray can be handled by the AbstractArray implementation.

- Dispatch to
`BLAS.gemm!`

where possible. - Loop reordering (it does give some improvements although we do blocked iteration)
- Optimizations for PermuteDimsArray
- Communication / computation time optimizations in Dagger a la Tensor Contraction Engine

- formalize interface for new array types to implement
- implementation for sparse matrices
- implementation for IndexedTable & AxisArrays
- Autodifferentiation
- Explore more operations to express in
`@arrayop`

- mapslices
- getindex
- stencils

## Differences with TensorOperations.jl

Jutho's TensorOperations.jl has inspired this package a whole lot. However, its codebase is tailored to work specifically on tensors of real and complex numbers, their contraction, transposition, conjugation and multiplication with scalars and it does that very well. This package aims to cover all of those features in a more general framework. Notable additions:

- Works on arrays of any type
- You can use any Julia function for combining arrays and reducing dimensions, and any constants as arguments (as opposed to allowing only scalar multiplication or offsets)
- Supports indexing where some dimensions can be constants, as in:

`@arrayop y[1, j] := x[i, j]`

to support operations like `reducedim`

.

- Finally, it has a dispatch system to pick different implementations for different array types.
`Dagger`

array operations have been implemented as an example.