(74a) MILP Based Value Backups to Solve POMDPs with Large or Continuous Action Spaces | AIChE

(74a) MILP Based Value Backups to Solve POMDPs with Large or Continuous Action Spaces

Authors 

Agrawal, R. - Presenter, Georgia Institute of Technology
Lee, J. H. - Presenter, Korea Advanced Institute of Science and Technology (KAIST)
Realff, M. - Presenter, Georgia Institute of Technology

 

Partially observable Markov decision processes (POMDPs) serve as powerful tools to model stochastic systems with partial state information. POMDPs have been demonstrated to be successful in a variety of applications ranging from machine maintenance and inspection, planning and scheduling, robotics, network troubleshooting, target identification, speech recognition etc. [1]. A POMDP represents a general Markov decision process (MDP), with incomplete knowledge of the system state at each time. Due to this, a probability distribution over the entire state space is maintained at all times. At any given time, this probability distribution is referred to as a belief state (b) and the optimal value function (V(b)) corresponding to the belief state, satisfies the Bellman equation (1). b' is the future belief state and it depends on current action and observation at the next time step. r(s,a) is the stage-wise reward for taking action a in belief state b and p(b'|b,a) is the transition probability.

Since b is continuous, the solution of the Bellman equations is not straightforward, owing to the presence of infinitely many belief states.  However, an interesting feature of POMDP is that the associated value function is piece-wise linear and convex [2]. This result has enabled many researchers to devise exact solution methods for POMDPs. Since they are limited to problems with very small sizes of state, action and observation spaces, approximate solution methods [3] have gained popularity. Notable among these are the point-based methods which consider a fixed or evolving set of prototype belief points instead of considering the entire belief simplex.

A particular point-based method, PERSEUS [4] favorably makes use of the PWLC structure of the value function to speed up convergence. However, in the current form of PERSEUS and many other point-based methods, presence of very large or continuous action space makes it practically impossible to compute the value backup exactly. In view of the PWLC value function, a mixed integer linear program (MILP) is used to circumvent this difficulty. In an alternative formulation, point update equations around post decision belief state are developed, as opposed to the traditionally used notion of (pre-decision) belief state. The latter approach reduces the computational load of the MILP, the extent of which is dependent on the size of the observation space and that of the optimal value function. For successful formulation of the MILP, the requirements on the structure of the reward function and the dependence of probabilities of state transition and observation on action are provided. Two illustrations, one each for pre-decision and post-decision belief states are presented to analyze the efficacy of the methods.

References

[1]  Cassandra, A. A survey of POMDPs applications. Planning with Partially Observable Markov Decision Processes, American Association for A.I. Symposium, (1998).   [2]  Smallwood, R.D.,  E.J. Sondik. The optimal control of partially observable Markov processes over a finite horizon. Operations Research, (1973).

[3]  Hauskrecht, M. Value-function approximations for partially observable Markov decision processes. Journal of artificial intelligence research 13 (2000) 33-94.

[4]  Spaan M.T.J., N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs., Journal of artificial intelligence research 24 (2005) 195-220.