NeuralProcesses.jl
NeuralProcesses.jl is a framework for composing Neural Processes built on top of Flux.jl.
NeuralProcesses.jl was presented at JuliaCon 2020 [link to video (7:41)].
See the Neural Process Family for code to reproduce the image experiments from Convolutional Conditional Neural Processes (Gordon et al., 2020).
Contents:
Introduction
The setting of NeuralProcesses.jl is metalearning. Metalearning is concerned with learning a map from data sets directly to predictive distributions. Neural processes are a powerful class of parametrisations of this map based on an encoding of the data.
train.jl
Predefined Experimental Setup: Eager to get started?!
The file train.jl
contains an predefined experimental setup that gets
you going immediately!
Example:
$ julia project=. train.jl model convcnp data matern52 loss loglik epochs 20
Here's what it can do:
usage: train.jl data DATA model MODEL [numsamples NUMSAMPLES]
[batchsize BATCHSIZE] loss LOSS
[startingepoch STARTINGEPOCH] [epochs EPOCHS]
[evaluate] [evaluateiw] [evaluatenoiw]
[evaluatenumsamples EVALUATENUMSAMPLES]
[evaluateonlywithin] [modelsdir MODELSDIR]
[bson BSON] [h]
optional arguments:
data DATA Data set: eqsmall, eq, matern52, eqmixture,
noisymixture, weaklyperiodic, sawtooth, or
mixture. Append "noisy" to a data set to make
it noisy.
model MODEL Model: conv[c]np, corconvcnp, a[c]np, or
[c]np. Append "global{mean,sum}" to
introduce a global latent variable. Append
"amortised{mean,sum}" to use amortised
observation noise. Append "het" to use
heterogeneous observation noise.
numsamples NUMSAMPLES
Number of samples to estimate the training
loss. Defaults to 20 for "loglik" and 5 for
"elbo". (type: Int64)
batchsize BATCHSIZE
Batch size. (type: Int64, default: 16)
loss LOSS Loss: loglik, loglikiw, or elbo.
startingepoch STARTINGEPOCH
Set to a number greater than one to continue
training. (type: Int64, default: 1)
epochs EPOCHS Number of epochs to training for. (type:
Int64, default: 20)
evaluate Evaluate model.
evaluateiw Force to use importance weighting for the
evaluation objective.
evaluatenoiw Force to NOT use importance weighting for the
evaluation objective.
evaluatenumsamples EVALUATENUMSAMPLES
Number of samples to estimate the evaluation
loss. (type: Int64, default: 4096)
evaluateonlywithin
Evaluate with only the task of interpolation
within training range.
modelsdir MODELSDIR
Directory to store models in. (default:
"models")
bson BSON Directly specify the file to save the model to
and load it from.
h, help show this help message and exit
Manual
Principles
Models
In NeuralProcesses.jl, models consists of an encoder and a decoder.
model = Model(encoder, decoder)
An encoder takes in the data and produces an abstract representation of the data. A decoder then takes in this representation and produces a prediction at target inputs.
Functional Representations and Coding
In the package, the three objects โ the data, encoding, and prediction โ
have a common representation. In particular, everything is represented as a
function: a tuple (x, y)
that corresponds to the function f(x[i]) = y[i]
for all indices i
.
Encoding and decoding, which we will collectively call coding, then become
transformations of functions.
Coding is implemented by the function code
:
xz, z = code(encoder, xc, yc, xt)
Here encoder
transforms the function (xc, yc)
, the context set, into
another function (xz, z)
, the abstract representation. The target inputs
xt
express the desire that encoder
should (not must) output a function
that maps from xt
.
If indeed xz == xt
, then the coding operation is called complete.
If, on the other hand, xz != xt
, then the coding operation is called
partial.
Encoders and decoders are complete coders.
However, encoders and decoders are often composed from simpler coders, and these
coders could be partial.
Compositional Coder Design
Coders can be composed using Chain
:
encoder = Chain(
coder1,
coder2,
coder3
)
Coders can also be put in parallel using Parallel
.
For example, this is useful if an encoder should output multiple encodings,
e.g. in a multiheaded architecture:
encoder = Chain(
...,
# Split the output into two parts, which can then be processed by two heads.
Splitter(...),
Parallel(
..., # Head 1
... # Head 2
)
)
By default, parallel representations are combined with concatenation along the
channel dimension (see Materialise
in the section below), but this is
readily extended to additional designs (see ?Materialise
).
Coder Likelihoods
A coder should output either a deterministic coding or a stochastic coding, which can be achieved by appending a likelihood:
deterministic_coder = Chain(
...,
DeterministicLikelihood()
)
stochastic_coder = Chain(
...,
HeterogeneousGaussianLikelihood()
)
When a model is run, the output of the encoder is sampled.
The resulting sample is then fed to the decoder.
In scenarios where the encoder outputs multiple encodings in parallel, it may be
desirable to concatenate those encodings into one big tensor which can then
be processed by the decoder.
This is achieved by prepending Materialise()
to the decoder:
decoder = Chain(
Materialise(),
...,
HeterogenousGaussian()
)
The decoder outputs the prediction for the data.
In the above example, decoder
produces means and variances at test inputs.
Available Models for 1D Regression
The package exports constructors for a number of architectures from the literature.
Name  Constructor  Reference 

Conditional Neural Process  cnp_1d 
Garnelo, Rosenbaum, et al. (2018) 
Neural Process  np_1d 
Garnelo, Schwarz, et al. (2018) 
Attentive Conditional Neural Process  acnp_1d 
Kim, Mnih, et al. (2019) 
Attentive Neural Process  anp_1d 
Kim, Mnih, et al. (2019) 
Convolutional Conditional Neural Process  convcnp_1d 
Gordon, Bruinsma, et al. (2020) 
Convolutional Neural Process  convnp_1d 
Foong, Bruinsma, et al. (2020) 
Gaussian Neural Process  corconvcnp_1d 
Bruinsma, Requeima et al. (2021) 
Download links for pretrained models are below. The instructions for how a pretrained model can be run are as follows:

Download some pretrained models.

Extract the models:
$ tar xzvf models.tar.gz
 Open Julia and load the model:
using NeuralProcesses, NeuralProcesses.Experiment, Flux
convnp = best_model("models/convnphet/loglik/matern52.bson")
 Run the model:
means, lowers, uppers, samples = predict(
convnp,
randn(Float32, 10), # Random context inputs
randn(Float32, 10), # Random context outputs
randn(Float32, 10) # Random target inputs
)
Foong, Bruinsma, et al. (2020)
Pretrained Models forInterpolation results for the pretrained models are as follows:
loglik
Model  eq 
matern52 
noisymixture 
weaklyperiodic 
sawtooth 
mixture 

cnp 
1.09  1.17  1.28  1.34  0.16  1.17 
acnp 
0.83  0.93  1.00  1.29  0.17  1.09 
convcnp 
0.69  0.88  0.93  1.19  1.09  0.94 
nphet 
0.76  0.90  0.94  1.23  0.16  0.88 
anphet 
0.53  0.74  0.66  1.17  0.10  0.63 
convnphet 
0.34  0.61  0.59  1.01  2.24  0.40 
elbo
Model  eq 
matern52 
noisymixture 
weaklyperiodic 
sawtooth 
mixture 

nphet 
0.34  0.66  0.66  1.21  0.12  0.71 
anphet 
0.71  0.88  0.80  1.27  0.00  0.86 
convnphet 
0.61  0.59  2.19  1.07  2.40  1.27 
loglikiw
Model  eq 
matern52 
noisymixture 
weaklyperiodic 
sawtooth 
mixture 

nphet 
0.28  0.57  0.47  1.20  0.32  0.59 
anphet 
0.45  0.67  0.57  1.19  0.16  0.61 
convnphet 
0.09  0.29  0.34  0.98  2.39  0.31 
Bruinsma, Requeima et al. (2021)
Pretrained Models forInterpolation results for the pretrained models are as follows:
loglik
Model  eqnoisy 
matern52noisy 
noisymixture 
weaklyperiodicnoisy 
sawtoothnoisy 
mixturenoisy 

convcnp 
0.80  0.95  0.95  1.20  0.55  0.93 
corconvcnp 
0.70  0.30  0.96  0.47  0.42  0.10 
anphet 
0.61  0.75  0.73  1.19  0.34  0.69 
convnphet 
0.46  0.67  0.53  1.02  1.20  0.50 
Building Blocks
The package provides various building blocks that can be used to compose
encoders and decoders.
For some building blocks, there is a constructor function available that can be
used to more easily construct the block.
More information about a block can be obtained by using the builtin help
function, e.g. ?LayerNorm
.
Glue
Type  Constructor  Description 

Chain 
Put things in sequence.  
Parallel 
Put things in parallel. 
Basic Blocks
Type  Constructor  Description 

BatchedMLP 
batched_mlp 
Batched MLP. 
BatchedConv 
build_conv 
Batched CNN. 
Splitter 
Split off a given number of channels.  
LayerNorm 
Layer normalisation.  
MeanPooling 
Mean pooling.  
SumPooling 
Sum pooling. 
Advanced Blocks
Type  Constructor  Description 

Attention 
attention 
Attentive mechanism. 
SetConv 
set_conv 
Set convolution. 
SetConvPD 
set_conv 
Set convolution for kernel functions. 
Likelihoods
Type  Constructor  Description 

DeterministicLikelihood 
DeterministicLikelihood output.  
FixedGaussianLikelihood 
Gaussian likelihood with a fixed variance.  
AmortisedGaussianLikelihood 
Gaussian likelihood with a fixed variance that is calculated from splitoff channels.  
HeterogeneousGaussianLikelihood 
Gaussian likelihood with inputdependent variance. 
Coders
Type  Constructor  Description 

Materialise 
Materialise a sample.  
FunctionalCoder 
Code into a function space: make the target inputs a discretisation.  
UniformDiscretisation1D 
Discretise uniformly at a given density of points.  
InputsCoder 
Code with the target inputs.  
MLPCoder 
Rhosumphi coder. 
Data Generators
Models can be trained with data generators. Data generators are callables that take in an integer (number of batches) and give back an iterator that generates fourtuples: context inputs, context outputs, target inputs, and target outputs. All tensors should be of rank three where the first dimension is the data dimension, the second dimension is the feature dimension, and the third dimension is the batch dimension.
Data generators can be constructed with DataGenerator
which takes in an
underlying stochastic process.
Stheno.jl can be used to build
Gaussian processes.
In addition, the package exports the following processes:
Type  Description 

Sawtooth 
Sawtooth process. 
BayesianConvNP 
A Convolutional Neural Process with a prior on the weights. 
Mixture 
Mixture of processes. 
Training and Evaluation
Experimentation functionality is exported by NeuralProcesses.Experiment
.
Running Models
A model can be run forward by calling it with three arguments: context inputs, context outputs, and target inputs. All arguments to models should be tensors of rank three where the first dimension is the data dimension, the second dimension is the feature dimension, and the third dimension is the batch dimension.
convcnp(
# Use a batch size of 16.
randn(Float32, 10, 1, 16), # Random context inputs
randn(Float32, 10, 1, 16), # Random context outputs
randn(Float32, 15, 1, 16) # Random target inputs
)
For convenience, the package also exports the function predict
, which
runs a model from inputs of type Vector
and produces predictive means,
lower and upper credible bounds, and predictive samples.
means, lowers, uppers, samples = predict(
convcnp,
randn(Float32, 10), # Random context inputs
randn(Float32, 10), # Random context outputs
randn(Float32, 10) # Random target inputs
)
Training
To train a model, use train!
, which, amongst other things, requires a
loss function and optimiser.
Loss functions are described below, and
optimisers can be found in Flux.Optimiser
;
for most applications, ADAM(5e4)
probably suffices.
After training, a model can be evaluated with eval_model
.
See train.jl
for an example of train!
.
Losses
The following loss functions are exported:
Function  Description 

loglik 
Biased estimate of the logexpectedlikelihood. Exact for models with a deterministic encoder. 
elbo 
Neural process ELBOstyle loss. 
Examples:
# 1sample logEL loss. This is exact for models with a deterministic encoder.
loss(xs...) = loglik(xs..., num_samples=1)
# 20sample logEL loss. This is probably what you want if you are training
# a model with a stochastic encoder.
loss(xs...) = loglik(xs..., num_samples=20)
# 20sample ELBO loss. This is an alternative to `loglik`.
loss(xs...) = elbo(xs..., num_samples=20)
See train.jl
for more examples.
Saving and Loading
After every epoch, the current model and top five best models are saved.
To file to which the model is written is determined by the keyword bson
of train!
.
After training, the best model can be loaded with best_model(path)
.
Examples
The Conditional Neural Process
Perhaps the simplest member of the NP family is the Conditional Neural Process (CNP). CNPs employ a deterministic MLPbased encoder, and an MLPbased decoder. As a first example, we provide an implementation of a simple CNP in the framework:
# The encoder maps into a finitedimensional vector space, and produces a
# global (deterministic) representation, which is then concatenated to every
# test point. We use a `Parallel` object to achieve this.
encoder = Parallel(
# The `InputsEncoder` simply outputs the target locations. We `Chain` this
# with a `DeterministicLikelihood` to form a complete coder.
Chain(
InputsCoder(),
DeterministicLikelihood()
),
Chain(
# The representation is given by a deepset network, which is
# implemented with the `MLPCoder` object. This object receives two MLPs
# upon construction, a prepooling network and postpooling network,
# and produces a vector representation for each context set in the
# batch.
MLPCoder(
batched_mlp(
dim_in =dim_x + dim_y,
dim_hidden=dim_embedding,
dim_out =dim_embedding,
num_layers=num_encoder_layers
),
batched_mlp(
dim_in =dim_embedding,
dim_hidden=dim_embedding,
dim_out =dim_embedding,
num_layers=num_encoder_layers
)
),
# The resulting representation is also chained with a
# `DeterministicLikelihood` as we are interested in a conditional model.
DeterministicLikelihood()
)
)
# The CNP decoder is also MLP based. It first `materialises` the encoder output
# (concatenates the target inputs and context set representation), and then
# passes these through an MLP that outputs a mean and standard deviation at
# every target location.
decoder = Chain(
# First, concatenate target inputs and context set representation. By
# default, `Materialise` uses concatenation to combine the different
# representations in a `Parallel` object, but alternative designs (e.g.,
# summation or multiplicative flows) could also be considered in
# NeuralProcesses.jl.
Materialise(),
# Pass the resulting representations through an MLPbased decoder. The
# input dimensionality is the dimensionality of the target inputs plus
# the dimensionality of the representation. The output dimension is
# twice the output dimensionality, since we require a mean and standard
# deviation for every location.
batched_mlp(
dim_in =dim_x + dim_embedding,
dim_hidden=dim_embedding,
dim_out =2dim_y,
num_layers=num_decoder_layers
),
# The `HeterogeneousGaussianLikelihood` automatically splits its inputs
# in two along the feature dimension, and treats the first half as the
# mean and second half as the standard deviation of a Gaussian
# distribution.
HeterogeneousGaussianLikelihood()
)
cnp = Model(encoder, decoder)
Then, after training, we can make predictions as follows:
means, lowers, uppers, samples = predict(
cnp,
randn(Float32, 10), # Random context inputs
randn(Float32, 10), # Random context outputs
randn(Float32, 10) # Random target inputs
)
The Neural Process
Neural Processes (NP) extend CNPs by adding
a latent variable to the model.
This enables NPs to capture joint, nonGaussian marginal distributions for
target sets, which in turn allows producing coherent samples.
Extending CNPs to NPs in NeuralProcceses.jl is extremely easy: we simply
replace the DeterministicLikelihood
component of the MLPCoder
with a
HeterogenousGaussian
, and adjust the output dimension of the encoder to
produce both means and variances!
# The only change to the encoder is replacing the `DeterministicLikelihood`
# following the `MLPCoder` with a `HeterogenousGaussian`!
encoder = Parallel(
Chain(
InputsCoder(),
DeterministicLikelihood()
),
Chain(
MLPCoder(
batched_mlp(
dim_in =dim_x + dim_y,
dim_hidden=dim_embedding,
dim_out =dim_embedding,
num_layers=num_encoder_layers
),
batched_mlp(
dim_in =dim_embedding,
dim_hidden=dim_embedding,
# Since `HeterogenousGaussian` splits its inputs along the
# channel dimension, we increase the output dimension of the set
# encoder accordingly.
dim_out =2dim_embedding,
num_layers=num_encoder_layers
)
),
# This is the main change required to switch between a CNP and an NP.
HeterogeneousGaussianLikelihood()
)
)
# We can then reuse the previously defined decoder as is!
np = Model(encoder, decoder)
Note that typical NPs consider both a deterministic and latent representation.
This is easily achieved in NeuralProcesses.jl by adding an additional encoder to
the Parallel
object (with a DeterministicLikelihood
), and increasing the
decoder dim_in
accordingly.
In this repo, the builtin NP model uses this form.
This example does not include a deterministic path to emphasise the ease of
switching between conditional and latent variable models in NeuralProcesses.jl.
The Attentive Neural Processes
Next, we consider a more complicated model, and demonstrate how easy it is to
implement with NeuralProcesses.jl.
Attentive Neural Processes (ANPs)
extend NPs by considering an attentive mechanism for the deterministic
representation.
Attention comes builtin with NeuralProcesses.jl, and so we can deploy it within
a Chain
or Parallel
like other building blocks.
Below is an example implementation of an ANP with a deterministic attentive
representation, and a stochastic (Gaussian) global representation.
# The encoder now aggregates three separate representations:
# (i) the target inputs, like the (C)NP,
# (ii) a deterministic attentive representation, and
# (iii) a stochastic global representation.
encoder = Parallel(
# First, include the `InputsCoder` to represent the target set inputs.
Chain(
InputsCoder(),
DeterministicLikelihood()
),
# NeuralProcesses.jl uses a transformerstyle multihead architecture for
# attention. It first embeds the inputs and outputs into a
# finitedimensional vector space with an MLP, and applies the attention in
# the embedding space. The constructor requires the dimensionalities of the
# inputs and outputs, the desired dimensionality of the embedding, and the
# number of heads to employ (each head will use a
# `div(dim_embedding, num_heads)`dimensional embedding), and the number of
# layers in the embedding MLPs. As ANPs employ attention for the
# deterministic representations, this is chained with a
# `DeterministicLikelihood`.
Chain(
attention(
dim_x =dim_x,
dim_y =dim_y,
dim_embedding =dim_embedding,
num_heads =num_encoder_heads,
num_encoder_layers=num_encoder_layers
),
DeterministicLikelihood()
),
# The latent path uses the same form as for the NP.
Chain(
MLPCoder(
batched_mlp(
dim_in =dim_x + dim_y,
dim_hidden=dim_embedding,
dim_out =dim_embedding,
num_layers=num_encoder_layers
),
batched_mlp(
dim_in =dim_embedding,
dim_hidden=dim_embedding,
dim_out =2dim_embedding,
num_layers=num_encoder_layers
)
),
HeterogeneousGaussianLikelihood()
)
)
# The decoder for the ANP is again MLPbased, and so has the same form as the
# (C)NP decoder. The only required change is to account for the dimensionality
# of the latent representation.
decoder = Chain(
Materialise(),
batched_mlp(
dim_in =dim_x + 2dim_embedding,
dim_hidden=dim_embedding,
dim_out =num_noise_channels,
num_layers=num_decoder_layers
),
noise
)
anp = Model(encoder, decoder)
The Convolutional Conditional Neural Process
As a final example, we consider the
Convolutional Conditional Neural Process
(ConvCNP).
The key difference between the ConvCNP and other NPs in terms of implementation
is that it encodes the data into an infinitedimensional function space, rather
than a finitedimensional vector space.
This is handled in NeuralPeocesses.jl with FunctionalCoder
s, which, in
addition to complete coders, also expect Discretisation
objects on construction.
Below is an implementation of the
Convolutional Conditional Neural Process:
# The encoder maps into a function space, which is what `FunctionalCoder`
# indicates.
encoder = FunctionalCoder(
# We cannot exactly represent a function, so we represent a discretisation
# of the function instead. We use a discretisation of 64 points per unit.
# The discretisation will span from the minimal context or target input
# to the maximal context or target input with a margin of 1 on either side.
UniformDiscretisation1D(64f0, 1f0),
Chain(
# The encoder is given by a socalled set convolution, which directly
# maps the data set to the discretised functional representation. The
# data consists of one channel. We also specify a length scale of
# twice the interpoint spacing of the discretisation. The function
# space that we map into is a reproducing kernel Hilbert space (RKHS),
# and the length scale corresponds to the length scale of the kernel of
# the RKHS. We also append a density channel, which ensures that the
# encoder is injective.
set_conv(1, 2 / 64f0; density=true),
# The encoding will be deterministic. We could also use a stochastic
# encoding.
DeterministicLikelihood()
)
)
decoder = Chain(
# The decoder first transforms the functional representation with a CNN.
build_conv(
4f0, # Receptive field size
10, # Number of layers
64, # Number of channels
points_per_unit =64f0, # Density of the discretisation
dimensionality =1, # This is a 1D model.
num_in_channels =2, # Account for density channel.
num_out_channels=2 # Produce a mean and standard deviation.
),
# Use another set convolution to map back from the space of the encoding
# to the space of the data.
set_conv(2, 2 / 64f0),
# Predict means and variances.
HeterogeneousGaussianLikelihood()
)
convcnp = Model(encoder, decoder)
State of the Package
The package is currently mostly a port from academic code. There are a still a number of important things to do:

Support for 2D data: The package is currently built around 1D tasks. We plan to add support for 2D data, e.g. images. This should not require big changes, but it should be implemented carefully.

Tests: The important components of the package are tested, but test coverage is nowhere near where it should be.

Regression tests: For the package, GPU performance is crucial, so regression tests are necessary.

Documentation: Documentation needs to be improved.
Implementation Details
Automatic Differentiation
Tracker.jl is used to automatically
compute gradients.
A number of custom gradients are implemented in src/util.jl
.
Parameter Handling
The package uses Functors.jl to handle
parameters, like Flux.jl, and adheres
to Flux.jl's principles.
This means that only AbstractArray{<:Number}
s are parameters.
Nothing else will be trained, not even Float32
or Float64
scalars.
To not train an array, wrap it with NeuralProcesses.Fixed(array)
,
and unwrap it with NeuralProcesses.unwrap(fixed)
at runtime.
GPU Acceleration
CUDA support for depthwise separable convolutions (DepthwiseConv
from
Flux.jl) is implemented in src/gpu.jl
.
Loop fusion can cause issues on the GPU, so oftentimes computations are unrolled.