The MDP package implements algorithms for Markov decision processes (MDP).
MDP can be installed through Julia package manager
julia> Pkg.clone("https://github.com/adityam/MDP.jl")
The MDP package currently implements the following algorithms
-
Infinite horizon discounted setup
- Value iteration
-
Finite horizon setup
- Standard dynamic program
Other algorithms related to MDPs will be implemented as time permits. Pull requests welcome.
There are two ways to specify the MDP model:
-
By specifying the controlled transition matrix and the cost and reward function.
-
By specifying the bellman update operator.
Perhaps the easiest way to understand these is by means of examples.
This example is taken from Puterman, "Markov Decision Processes", Wiley 2005 (Chapter 3.1)
Consider a Markov decision process where
-
States S = {1, 2}
-
Actions A = {1, 2}
-
Rewards r(s,a) given by
r(1,1) = 5 r(1,2) = 10 r(2,1) = -1 r(2,2) = -Inf
Note that
r(2,2) = -Inf
means that action 2 is infeasible at state 2. -
Transition probabilities P(s_next | s_current, action)
P(1 | 1, 1) = 0.5 P(2 | 1, 1) = 0.5 P(1 | 2, 1) = 0.0 P(2 | 2, 1) = 1.0 P(1 | 1, 2) = 0.0 P(2 | 1, 2) = 1.0 P(1 | 2, 2) = 0.5 P(2 | 2, 2) = 0.5
This model is specified as follows:
-
Specify the rewards as a
S x A
matrix.r = [ 5 10 -1 -Inf]
-
Specify the transition matrix as a list
{ P( . | ., 1), P( . | ., 2) }
.P = Matrix[ [ 0.5 0.5 0.0 1.0 ], [ 0.0 1.0 0.5 0.5 ] ]
This list may also be specified as a sparse matrix (this can lead to considerable speedup when the transition matrices are sparse)
using SparseArrays P = SparseMatrixCSC[sparse([0.5 0.5; 0.0 1.0]), sparse([0.0 1.0; 0.5 0.5]) ]
-
Specify an MDP model
model = ProbModel(r, P; objective=:Max)
objective=:Max
specifies that we want to maximize reward. To minimize a cost function, useobjective=:Min
.
To solve this model using value iteration use
(v,g) = valueIteration(model; discount=0.95)
v
is the value function (stored as a S
dimensional vector) nad g
is the
policy (again, stored as a S
dimensional vector).
The optimal arguments for valueIteration
are
-
discount
(default0.95
): The discount factor. -
iterations
(default1_000
): The maximum number of iteration to be perfermoed. -
tolerance
(default1e-4
): Stop the iteration when sup-norm ofv - v_opt
is less thantolerance
, wherev_opt
is the optimal solution.To speed up convergence, we use span seminorm and do a correction before returning the value function. See Puterman Sec 6.6 for details.
-
intial_v
(default, an all zeros-vector).
Here is the full example:
using MDP
P = Matrix[ [ 0.5 0.5
0.0 1.0 ],
[ 0.0 1.0
0.5 0.5 ] ]
r = [ 5 10
-1 -Inf ]
model = ProbModel(r, P; objective=:Max)
(v,g) = valueIteration(model; discount=0.95)
println(g)
println(v)
which outputs
[ Info: value iteration will converge in at most 20 iterations
[ Info: Reached precision 5.742009e-05 at iteration 18
[1, 1]
[-8.5714, -20.0]