Devectorize.jl

Devectorize -- A Julia Framework for De-vectorized Evaluation

Devectorize is a Julia framework, which provides macros and functions to de-vectorize a vector expression. With Devectorize, users can write computations in high-level vectorized way and at the same time enjoying the high run-time performance of de-vectorized loops. Devectorize automatically translates vectorized expressions into faster tight-loops, which often results in 2x - 8x performance gain.

Why Devectorize

In many programming languages (including Julia), expressions are immediately evaluated upon construction. This simple strategy often results in less than optimal behaviors, which, for example, include creation of unnecessary temporaries and repeated memory round-trips. Consider the following example,

r = a .* b + c .* d + a

With immediate evaluation, three temporaries, respectively for storing the results of a .* b, c .* d , and a .* b + c .* d. Also, the array a will be traversed twice. Moreover, computation on large arrays is often memory-bound -- the run-time performance largely depends on how many times you have to scan the arrays.

For the formula above, a much more efficient way to evaluate it can be expressed using for-loops as follows

n = length(a)
r = zeros(n)
for i = 1 : n
	r[i] = a[i] * b[i] + c[i] * d[i] + a[i]
end

With this piece of code, you can get all the results in one pass, without creating any temporary arrays. However, low-level for-loops are often much longer and more difficult to read, write, and maintain.

Is it possible to combine the elegance of high-level expressions and the performance of low-level for-loops?

The answer is Yes. Let's look at the examples above, we can hold off the evaluation of all the temporaries until the assignment to r happens -- at this point, an integrated loop is emitted to compute all results in one pass.

The powerful meta-programming framework of Julia makes it possible to achieve this goal using incredibly simple syntax. Taking advantage of this framework, Devectorize provides a macro @devec:

@devec r = a .* b + c .* d + a

This statement is exactly the same as the one we saw above -- except for the macro @devec, which performs all the magic of translating the formula into a one-pass loop behind the scenes.

The remaining part is organized into two section: Basic Usage, which introduces how to use Devectorize to improve the performance of your code, and Design of the Framework, which provides a brief overview of the framework and its structures.

Basic Usage

You may install the package using Julia's official package manager:

Pkg.add("Devectorize")

To keep it always updated, you may have to switch Devectorize to the master branch, and git pull the latest commits.

For ordinary use, you only have to remember one macro -- @devec. Putting it before the assignments that you want to de-vectorize, it will automatically translate your expressions into efficient loops. For example, you can write

@devec r = exp(a + b) .* sum(c)

To inspect the code generated by Devectorize, you can use the @inspect_devec macro:

@inspect_devec r = exp(a + b) .* sum(c)

This statement will prints the generated codes (prior to evaluating them).

Benchmark

Here is a table of benchmark results on some typical cases.

	julia vec	@devec	hand-coded loop
simple-ewise	1.0000	2.6032x	2.5719x
complex-ewise	1.0000	2.4581x	2.4364x
shift-dot	1.0000	8.3237x	8.2959x
colwise-sum	1.0000	1.3321x	1.2771x
rowwise-sum	1.0000	4.2736x	4.2444x
colwise-eucdist	1.0000	5.6502x	5.5356x

The result was obtained with Julia commit 3f92b13210 (2013-02-03) on Mac OS X 10.8, using the script test/bench_devec.jl, which comes with the Devectorize package.

Here, we use vectorized Julia code as the baseline, and report the performance gains (for example, if the baseline takes 1 sec, and devec takes 0.5sec, then the gain is 2x). We can see that codes tagged with the @devec macro typically performs 2x to 5x faster than vectorized codes, and is comparable (sometimes even slightly faster than) a hand-coded for loop.

It is important to note that Devectorize only recognizes a subset of expressions of Julia (but the most commonly used subset), as listed below.

Element-wise map of numbers and arrays

@devec r = a + b + c
@devec r = sin(a) + exp(a + 1.0) .* log(c)

Here is the list of operators and functions currently supported by Devectorize:

+, -, .+, .-, .*, ./, .^, max, min, clamp, blend,
.==, .!=, .<, .>, .<=, .>=, 
sqrt, cbrt, sqr, rcp, floor, ceil, round, trunc,
exp, log, log10, exp2, log2, expm1, log1p, 
sin, cos, tan, asin, acos, atan, 
sinh, cosh, tanh, asinh, acosh, atanh,
erf, erfc, gamma, lgamma, digamma

Notes:

Operator * and / are not supported, as they entail complex semantics depending on the arguments which may only be known at run-time. Users can use .* and ./ to express element-wise multiplication and division, which are perfectly supported in Devectorize.
These three functions: sqr(x -> x * x), rcp(x -> 1 / x), and blend((c, x, y) -> c ? x : y) are not in the Base module of Julia. They are provided by Devectorize as extensions to make it easier to write vectorized expressions (and then @devec it).

Simple references

@devec r = x[:,1] + y[:,2]
@devec r = a[i,:] .* b
@devec r[:,j] = x + sin(a[:,j])

Simple reference here means the reference expressions in either of the following forms: ```a[:], a[i], a[i, j], a[:, :], a[:, i], a[i, :]``, where i can be either an integer or a symbol that refers to an integer variable. Reference expressions can appear in both left and right hand side of an assignment. Support of more flexible references is planned for future releases.

Note that when you write

r = a + b .* c

Devectorize will emit codes that creates an array to store the results and bound it to r, this process may entails some overhead of inferring the type and shape of the result and creating a new array. When r has been created, you may eliminate such runtime overheads by writing

r[:] = a + b .* c

Then, the results will be directly written to r, and no array will be created before evaluation.

Op-assignment

@devec r += a
@devec r[:,i] .*= sin(a)

Devectorize will automatically translate them into ordinary assignment expressions.

Full/Colwise/Rowwise reduction

@devec r = sum(a + b)
@devec r = maximum(sin(a), 1)
@devec r[:,j] = mean(a, 2)

Devectorize currently recognizes five reduction functions sum, maximum, minimum, mean, and dot.

Hybrid expressions

Consider the example below,

@devec r = (a - sum(a)) .* b

This seemingly simple expression actually requires two loops to evaluate, one for computing sum(a), and the other for the top-level element-wise expression. Devectorize recognizes such situations, and will emit correct codes to perform the evaluation. For the example above, Devectorize will first break the expression into two ones, as

@devec tmp1 = sum(a)
@devec r = (a - tmp1) .* b

Note that Devectorize only breaks expressions only when it is really necessary to do so, and tries to generate as few memory traversals as possible.

Block expressions

@devec begin
	a = sin(x) - cos(y)
	b = sum(a) + exp(z)
	c = x .* y - maximum(b)
end

In current implementation, Devectorize simply de-vectorizes each assignment expression respectively. In future version, it may use a more sophisticated algorithm to identify opportunities of sharing some computation across expressions.

Design of the Framework

In Devectorize, the process of translating a Julia expression into de-vectorized codes goes through two stages:

translate a Julia expression to a typed expression (enriched with semantic information), using texpr
compile the typed expression into de-vectorized codes, using compile, which itself takes three steps:
- decompose a given expression into a sequence of basic expressions (e.g. break a hybrid expression or a block expression)
- compose loops for each basic expression via a back-end factory, using compose
- integrate all generated loops into a code block and return

Typed expressions

Julia front-end parses any input expression into an instance of Expr, which contains only syntatic information but not semantic information. To generate the code, one has to first understand the semantics (i.e. meaning) of the expression, e.g. whether it is doing a reduction or a element-wise transformation.

To express the semantics of an expression, Devectorize establishes a type hierarchy in src/texpr.jl. The hierarchy can be briefly summarized as follows

TExpr
-- TEWise  	   # everything can serve as an element-wise argument
   -- TScalar  # everything that is sure to be a scalar 
      -- TNum  # numerical literals, e.g. 1, 2.0, ...
      -- TScalarSym     # a symbol that is known to be a scalar (e.g. result of a full reduction)
      -- TRefScalar1    # a[i]
      -- TRefScalar2    # a[i,j]
   -- TSym     # a symbol that refers to a variable (can be an array or a scalar)
   -- TRef
      -- TRef1D     # a[:]
      -- TRef2D     # a[:,:]
      -- TRefCol    # a[:,j]
      -- TRefRow    # a[i,:]
   -- TMap     # element-wise map, e.g. sin(a), a + b, a + b .* c, ...
-- TReduc           # full reduction, e.g. sum(a), sum(a + b .* c), ...
-- TColwiseReduc    # column-wise reduction
-- TRowwiseReduc    # row-wise reduction
-- TAssign     # asssignment, e.g. a = sin(x), r[:,i] = a + cos(x[:,j]), ...
-- TBlock      # a block of expressions

The function texpr (also defined in src/texpr.jl) takes an instance of Expr as an argument, analyzes it. If the expression is recognized, it returns a typed expression (i.e. an instance of TExpr), otherwise, it raises an error (to be more specific, throws an exception of type DeError.)

The analysis performed in texpr relies on the semantic information provided by the functions in src/fun_traits.jl. These functions can tell you sin is an element-wise mapping that takes one argument, while sum is a reduction. They also tell you result type information, e.g. the element type of a + b is promote_type(eltype(a), eltype(b)), while that for .== is Bool.

Contexts

To make the framework extensible, Devectorize introduces the notion of Context, which refers to a specific setting in which the codes are generated (e.g. CPU, SIMD, CUDA, OpenCL, etc)

The abstract type EvalContext (in src/compile_base.jl) is used as the super class of all contexts. In current version, Devectorize provides a specific context type, namely ScalarContext, in which expressions are mapped to de-vectorized for-loops.

In future, other contexts might be introduced (e.g. SIMD and CUDA), thus providing users options to choose specific ways to emit the evaluation code for their expressions.

Compilation

The function compile takes two arguments: a context and a typed expression, and returns a the generated codes. Generally, this function is a driver, which actually delegates the code generation to two functions: compose_init (for generating codes for initialization) and compose_main (for generating the main loops). These two functions are provided by specific back-ends.

To reduce the complexity of back-end implementation, the compile function performs some preprocessing, which includes

translates blocks and hybrid expressions into a sequence of basic expressions
identifies trivial assignments (i.e. a = b), and simply emits it (as a = b). Note that this simply bounds the name a to the object referred by b, which does not involve any real computation.
take precautions to prevent potential alias problems. For example, it translates a = b + sin(a) into two statements, tmp = b + sin(a), and a trivial assignment a = tmp. The temporary name is generated using gensym to avoid collision with other names.

After this processing, the back-end can be implemented in a much simpler way, without taking into account such intricacies.

The functions to generate codes for ScalarContext are in src/scalar_backend.jl.

Code composition

The routines in src/scalar_backend.jl uses recursive kernel composition to generate loop kernels.

Take the expression a + b .* c for example. It first generates get_value(a, i), get_value(b, i), and get_value(c, i) for the terminals a, b, and c. Here, get_value is an overloaded function to ensure correct behavior for different cases (e.g. a can be either a scalar or an array).

For b .* c, it takes the generated kernel for b and c (as above), combines them with the operator .*, and then emits .*(get_value(b, i), get_value(c, i). Likewise, for a + b .* c, it emits +(get_value(a, i), .*(get_value(b, i), get_value(c, i)).

The compose_main function will generate a loop that uses the generated kernel as the loop body.