PrivateMultiplicativeWeights.jl
This package implements MWEM
, a simple and practical algorithm for differentially private data release.
MIT Licensed. See LICENSE.md
.
Installation
Install required packages, then open a Julia prompt and call:
using Pkg
Pkg.add("PrivateMultiplicativeWeights")
Main Features
 Differentially private synthetic data preserving lower order marginals of an input data set
 Optimized inmemory implementation for small number of data attributes
 Scalable heuristic for large number of data attributes
 Easytouse interfaces for custom query sets and data representations
Examples
Histogram approximations
Check out histograms.ipynb
for details on how to
use the algorithm to compute differentially private histogram approximations.
Marginal approximations
The package can also be used to create synthetic data that approximates the lower order marginals of a data set with binary features. For the sake of illustration, we create a random data set with hidden correlations. Columns correspond to data points.
d, n = 20, 1000
data_matrix = rand(0:1, d ,n)
data_matrix[3, :] = data_matrix[1, :] .* data_matrix[2, :]
We can run MWEM to produce synthetic data accurate for 1st, 2nd, 3rd order marginals of the source data.
using PrivateMultiplicativeWeights
mw = mwem(Parities(d, 3), Tabular(data_matrix))
This will convert the data to its explicit histogram representation of size 2^d and may not be useful when d is large. See section on factored histograms for an alternative when the dimension d is large.
Convert histograms to matrices
We can convert synthetic data in histogram representation to a tabular (matrix) representation.
table = Tabular(mw.synthetic, n)
Compute error of approximation
Compute error achieved by MWEM:
maximum_error(mw), mean_squared_error(mw)
Note that these statistics are not differentially private.
Parameters
Parameters can be set flexibly with the MWParameters
constructor:
mw = mwem(Parities(d, 3),
Tabular(data_matrix),
MWParameters(epsilon=1.0,
iterations=10,
repetitions=10,
verbose=false,
noisy_init=false,
init_budget=0.05,
noisy_max_budget=0.5))
Available parameters:
Name  Default  Description 

epsilon 
1.0 
Privacy parameter for the algorithm. Each iteration of MWEM is epsilon differentially private. Total privacy guarantees follow via composition theorems. 
iterations 
10 
Number of iterations of MWEM. Each iteration corresponds to selecting one query via the exponential mechanism, evaluating the query on the data, and updating the internal state. 
repetitions 
10 
Number of times MWEM cycles through previously measured queries per iteration. This has no additional privacy cost. 
noisy_init 
false 
This requires part of the epsilon privacy cost. When noisy_init is set to false, the initialization is uniform. 
init_budget 
0.05 
In case the noisy_init flag is set to true, this flag decide what fraction of the epsilon privacy cost will be given for the noisy initialization. When noisy_init is set to false, all the budget will be used by the iterations. 
noisy_max_budget 
0.5 
Decise what fraction from the epsion privacy badget of every iteration will go to the "noisy max" step. (the rest is for the Exponential Mechanism) 
verbose 
false 
print timing and error statistics per iteration (information is not differentially private) 
The function MWParameters
accepts any subset of parameters, e.g.,
MWParameter(epsilon=0.5, iterations=5)
.
Data representations
Histogram representation
By default, MWEM works with the histogram representation of a data sets. This
means that the data is represented by a vector whose length is equal to the size
of domain. For example, data consisting of d
binary attributes would be
converted to an array of length 2^d
. MWEM needs to store and array of this
length in main memory, which is often the computational bottleneck.
Factored histograms
When the histogram representation is too large, try using factored histograms. Factored histograms maintain a product distribution over clusters of attributes of the data. Each component is represented using a single histogram. Components are merged as it becomes necessary. This often allows to scale up MWEM by orders of magnitude.
d, n = 100, 1000
data_matrix = rand(0:1, d, n)
data_matrix[3, :] = data_matrix[1, :] .* data_matrix[2, :]
mw = mwem(FactorParities(d, 3), Tabular(data_matrix))
Also see examples.jl
.
Query representations
There are two ways to define custom query sets.
Histogram queries
Histogram queries are linear functions in the histogram representation of the
data. You can define custom query workloads by using
HistogramQueries(query_matrix)
instead of Parities(d, 3)
. Here query matrix
is an N x k
matrix specifying the query set in its Histogram
representation, N
is the histogram length and k
is the k
is the number of
queries.
Custom query types
To build query sets with your own implicit representations, subtype
Query
and Queries
. Implement the functions specified in src/interface.jl
.
See src/parities.jl
for an example.
Available query sets

Parities(d, k)
Parities of
k
out ofd
attributes. This corresponds to approximatingk
way marginals of the original data. 
FactorParities(d, k)
Parities of
k
out ofd
attributes for factored histogram representation. 
SeriesRangeQueries(N)
Range queries corresponding to all interval queries over a histogram of length
N
. SeriesRangeQueries*(Intervals)
Range queries over histogram with length N, corresponding to intervals = {Interval1, Interval2, ...} where Interval = (i, j) so that 1 <= i <= j <= N.
Contributing to this package
There are many ways to contribute to this repository:
 Experiments
 Additional query sets (e.g., twodimensional range queries)
 Additional tests, debugging, optimization
 Additional documentation
Citing this package
The MWEM algorithm was presented in the following paper:
@inproceedings{HLM12,
author = "Moritz Hardt and Katrina Ligett and Frank McSherry",
title = "A simple and practical algorithm for differentiallyprivate data release",
booktitle = {Proc.\ $26$th Neural Information Processing Systems (NIPS)},
year = {2012},
}