MDPs with Non-Deterministic Policies. - Internet Archive Scholar

In this paper we introduce the new concept of non-deterministic MDP policies, and address the question of finding near-optimal non-deterministic policies. ... Markov Decision Processes (MDPs) have been extensively studied and used in the context of planning and decision-making, and many methods exist to find the optimal policy for problems modelled as MDPs. ... A non-augmentable ✏-optimal non-deterministic policy ⇧ on an MDP M is a policy that is not augmentable according to the constraint in Eqn 6. ...

pmid:21625292 pmcid:PMC3103230 fatcat:loud5sq3zrh4hkqgucsrjvvkze

We consider the problem of policy optimization for a resource-limited agent with multiple timedependent objectives, represented as an MDP with multiple discount factors in the objective function and constraints ... We show that limiting search to stationary deterministic policies, coupled with a novel problem reduction to mixed integer programming, yields an algorithm for finding such policies that is computationally ... deterministic policies provides the first practical approach to dealing with constrained MDPs with multiple discount factors. ...

dblp:conf/ijcai/DolgovD05 fatcat:gxtw5mmmg5e4bixvazvn2hffqm

In an experiment with human subjects, we show that humans assisted by hints based on non-deterministic policies outperform both human-only and computer-only agents in a web navigation task. ... We provide two algorithms to compute non-deterministic policies in discrete domains. We study the output and running time of these method on a set of synthetic and real-world problems. ... A non-augmentable -optimal non-deterministic policy Π on an MDP M is a policy that is non-augmentable according to the constraint in Eqn 23. ...

doi:10.1613/jair.3175 fatcat:arqm4qzs5vcnvbpjgsf53bzrnu

DOAJ Szczepanski Multiple Versions

Markov Decision Processes (MDPs) are a well known mathematical formalism that combines probabilities with decisions and allows one to compute optimal sequences of decisions, denoted as policies, for fairly ... However, the practical application of MDPs is often faced with two problems: the specification of large models in an efficient and understandable way, which has to be combined with algorithms to generate ... Every path starting with a non deterministic state followed by a "composite action" σ and by a (maximal) sequence of probabilistic states ending with a non deterministic state, is substituted in the BMDP ...

doi:10.1007/978-3-319-23267-6_12 fatcat:vwjyvq2fkzgmzcinv3rljnmkim

This paper presents Saturated Path-Constrained Markov Decision Processes (SPC MDPs), a new MDP type for planning under uncertainty with deterministic model-checking constraints, e.g., "state s must be ... We present a mathematical analysis of SPCMDPs, showing that although SPC MDPs generally have no optimal policies, every instance of this class has an epsilon-optimal randomized policy for any > 0. ... Although PC MDPs are more general than SPC MDPs (they allow probabilistic -or non-saturated -constraints of the kind "the solution policy must visit state s before s at least with probability p"), their ...

doi:10.1609/aaai.v28i1.9041 fatcat:74vbkkvg55c3tel5s7ku4wbvjm

We focus on Markov decision processes (MDPs, for short). ... We survey the basic ingredients of MDP model checking and discuss its enormous developments since the seminal works by Courcoubetis and Yannakakis in the early 1990s. ... Randomized positional policies select μ ∈ D(s i ) with a certain probability. Deterministic policies select a fixed distribution from D(s i ). ...

doi:10.1007/978-3-319-91908-9_21 fatcat:yjsuwb5ibjff3cq3niatu6sbxq

with maximal probability. ... LCRL is a software tool that implements model-free Reinforcement Learning (RL) algorithms over unknown Markov Decision Processes (MDPs), synthesising policies that satisfy a given linear temporal specification ... Namely, when there exists a non-deterministic transition in an LDBA state, the MDP action space is augmented with the non-deterministic transition predicate of the LDBA. ...

arXiv:2209.10341v1 fatcat:e4vhttrwsvcqdpbfexdlo2w2im

policy that maximizes the probability of satisfying a given LTL specification with optimality guarantees. ... Linear Temporal Logic (LTL) is widely used to specify high-level objectives for system policies, and it is highly desirable for autonomous systems to learn the optimal policy with respect to such specifications ... The common choices of automata include deterministic Rabin automata and non-deterministic Büchi automata. ...

arXiv:2305.01381v2 fatcat:navp6kp3pfdkpc7mssqlrybn2a

Multiple Versions

In this paper we consider only partially observable MDPs (POMDPs), a useful class of non-Markovian decision processes. ... ., 1983, for an exception) of RL is limited to Markovian decision processes (MDPs). ... Therefore, in MDPs the agent can restrict its search to the nite set of stationary deterministic policies. ...

doi:10.1016/b978-1-55860-335-6.50042-8 dblp:conf/icml/SinghJJ94 fatcat:22vp5lktdzbbxbe7vslfeb35qi

the winners of previous competitions (FF-Replan, FPG, RFF), and show that our discounted heuristics solve more problems than non-discounted ones, with better criteria values. ... Previous attempts like mGPT used classical planning heuristics to an all-outcome determinization of MDPs without discount factor; yet, discounted optimization is required to solve problems with potential ... A problem is considered solved by a planner if its policy reaches a goal state with a non zero probability. Candidate MDP heuristic algorithms. ...

doi:10.1609/aaai.v25i1.8016 fatcat:arvs5sclfzfkvb6lcmuzo3nv6m

We first translate the LTL specification into a Limit Deterministic Büchi Automaton (LDBA), which is then used in an on-the-fly product with the PL-MDP. ... Uncertainty is considered in the workspace properties, the structure of the workspace, and the agent actions, giving rise to a Probabilistically-Labeled Markov Decision Process (PL-MDP) with unknown graph ... The following theorem shows that the optimal policy produced by Algorithm 1 satisfies the given LTL property with non-zero probability. ...

doi:10.1109/cdc40024.2019.9028919 dblp:conf/cdc/HasanbeigKAKPL19 fatcat:w4yfiw5vkrdnnhmiqp6aopc2ya

This note provides upper bounds on the number of operations required to compute by value iterations a nearly optimal policy for an infinite-horizon discounted Markov decision process with a finite number ... reward function, and desired closeness to optimality, these upper bounds are strongly polynomial in the number of state-action pairs, and one of the provided upper bounds has the property that it is a non-decreasing ... As discussed above, γ = 1 for this MDP with deterministic transitions. ...

arXiv:2001.10174v1 fatcat:gss3vsncgvd3tcx2nu4xj2hjqa

Experiments were performed on deterministic, non-deterministic and non-stationary versions of the puzzles. ... ASP is applied to represent the domain as an MDP, while a Reinforcement Learning algorithm (Q-Learning) is used to find the optimal policies. ... (c) T Test comparing the oASP(MDP) with Heuristic and the traditional oASP(MDP). Fig. 6 : 6 Number of Steps and Return results for the Non-Deterministic Fisherman's Folly puzzle. ...

arXiv:1903.03411v1 fatcat:h7uudj43zzavzocg2kshefcuma

We present a method to find an optimal policy with respect to a reward function for a discounted Markov decision process under general linear temporal logic (LTL) specifications. ... These occupancy measures are then connected to a single policy via a novel reduction resulting in a mixed integer linear program whose solution provides an optimal policy. ... s ∈ S × visited with non-zero probability, policy π × is deterministic and, further, y sa > 0 if and only if ∆ sa = 1. ...

arXiv:2011.00632v2 fatcat:d6nzjaccrvgu3fkmgttaxlefo4

Multiple Versions

Moreover, when the MDP dynamics are deterministic, the Bellman equation for 𝒵 is linear, allowing direct solutions that are unavailable for the nonlinear equations associated with traditional value functions ... The policies learned via these 𝒵-based Bellman updates are tightly linked to Boltzmann-like policy parameterizations. ... By working with partition functions we transformed a non linear problem into a linear one. This remarkable result is reminiscent of linearly solvable MDPs (Todorov, 2007) . ...

arXiv:1906.10228v2 fatcat:fkt4i2kjhrarxipjm4up3vriea

Multiple Versions

MDPs with Non-Deterministic Policies

Preserved Fulltext

Stationary Deterministic Policies for Constrained MDPs with Multiple Rewards, Costs, and Discount Factors

Preserved Fulltext

Non-Deterministic Policies in Markovian Decision Processes

Preserved Fulltext

Other Versions

Markov Decision Petri Nets with Uncertainty [chapter]

Preserved Fulltext

Saturated Path-Constrained MDP: Planning under Uncertainty and Deterministic Model-Checking Constraints

Preserved Fulltext

The 10,000 Facets of MDP Model Checking [chapter]

Preserved Fulltext

LCRL: Certified Policy Synthesis via Logically-Constrained Reinforcement Learning [article]

Preserved Fulltext

Sample Efficient Model-free Reinforcement Learning from LTL Specifications with Optimality Guarantees [article]

Preserved Fulltext

Learning Without State-Estimation in Partially Observable Markovian Decision Processes [chapter]

Preserved Fulltext

Extending Classical Planning Heuristics to Probabilistic Planning with Dead-Ends

Preserved Fulltext

Reinforcement Learning for Temporal Logic Control Synthesis with Probabilistic Satisfaction Guarantees

Preserved Fulltext

Strong Polynomiality of the Value Iteration Algorithm for Computing Nearly Optimal Policies for Discounted Dynamic Programming [article]

Preserved Fulltext

Heuristics, Answer Set Programming and Markov Decision Process for Solving a Set of Spatial Puzzles [article]

Preserved Fulltext

Synthesis of Discounted-Reward Optimal Policies for Markov Decision Processes Under Linear Temporal Logic Specifications [article]

Preserved Fulltext

Other Versions

A Theoretical Connection Between Statistical Physics and Reinforcement Learning [article]

Preserved Fulltext

Other Versions