Optimal Control and Reinforcement Learning for Stochastic Systems under Temporal Logic Specifications

Li, Lening

Etd

Optimal Control and Reinforcement Learning for Stochastic Systems under Temporal Logic Specifications

Public Deposited

This thesis aims to explore methods of near-optimal stochastic planning given high-level formal specifications. High-level formal specifications specify the properties that system behaviors should satisfy; optimal stochastic planning aims to maximize the system performance given criteria. With high-level formal specifications, stochastic optimal planning desires to synthesize a control policy to maximize the probability of system behavior satisfying these specifications. We commonly encounter such problems in defense operations, robotics, and other cyber-physical systems. However, algorithms requiring full system model knowledge or not scaling well cannot achieve optimal performance due to the unavailability or intractable size of system models. This thesis presents approximate algorithms that achieve near-optimal performance. The main contributions of this thesis are listed as follows. 1) We translate system specifications in probabilistic computation tree logic formulas to hard constraints and present a probabilistically complete approximate dynamic programming algorithm for near-optimal planning with multiple objectives; 2) We present a comprehensive optimal planning framework that leverages the topological information to accelerate policy learning; 3) To tackle continuous-state dynamic systems, we offer a variant of the actor-critic algorithm inspired by the proposed approximate dynamic programming algorithm; 4) We extend our framework into a two-player setting, where the agent exploits asymmetric information between players to synthesize a policy that outmaneuvers the opponent. This thesis first develops an approximate dynamic programming method given soft performance criteria and hard constraints specified in a class of probabilistic computation tree logic formulas for stochastic systems. We model a stochastic system as a Markov decision process. Our approach consists of two steps: First, with suitably defined cost functions, we translate a class of probabilistic computation tree logic formulas into chance constraints enforced during planning. Second, we devise a probabilistically complete sampling-based method by integrating randomized optimization and dynamic programming with the softmax Bellman operator. The optimization iteratively solves for an upper bound while satisfying translated constraints. By adopting an on-policy sampling fashion, we achieve a tight error bound between the upper bound given by the approximation and the ground truth of the value function. In the case of the Markov decision process with a high-level formal specification expressed by linear temporal logic formulas, probabilistic optimal planning remains challenging because of sparse rewards. This thesis then presents a policy learning framework guided by topological information encoded in formal specifications to address this issue. First, we follow the standard procedure to translate the specification into a corresponding deterministic finite-state automaton. Then, we take the product between the Markov decision process and deterministic finite-state automaton to construct a product system. Our algorithm finds topological order that describes the structural information about deterministic finite-state automaton. This topological order allows us to propose a framework for updating values with optimality guarantees and accelerating policy learning. Further, this thesis utilizes our probabilistically complete approximate dynamic programming algorithm to efficiently learn a near-optimal value function and the associated policy. Further, this thesis investigates the formal policy synthesis of continuous-state stochastic systems given high-level specifications in linear temporal logic. Since the system has continuous-state space, the product system has a hybrid product state space that worsens the reward sparsity issue. We present an actor-critic reinforcement learning algorithm where topological order is applicable. This algorithm employs advanced mathematical techniques, enjoying the property of hyperparameter self-tuning. We prove the optimality and convergence of our actor-critic algorithm. This work uses neural networks to approximate the value and policy functions as an alternative to storing intractable hybrid-state system models. While constructing a deterministic finite-state automaton, assigning integer numbers to automaton states can rank the approximated value or policy functions. We use modular learning to break the ordinal relationship by using an individual neural network for each automaton state’s value function (policy). Dynamic systems interacting with stochastic environments consider the case of symmetric information, and this thesis investigates beyond that. Additionally, we develop the optimal probabilistic planning of deception using a concurrent stochastic game with high-level formal specifications. There are two players: agent (player 1) and adversary (player 2); the adversary holds incomplete knowledge about the agent’s task specification. During the adversarial interaction, the adversary infers the agent’s intention from observation and takes actions to prevent the agent from accomplishing the task. By contrast, the agent exploits the incomplete information of its adversary to outmaneuver its adversary. This thesis introduces a class of hypergame models that capture the dynamic interaction between the player and the adversary in the presence of asymmetric, incomplete information. Further, this thesis establishes a solution concept for this class of hypergames. We design an online detection mechanism that alarms the agent with potential errors in modeling the adversary’s behavior. The thesis concludes with an overview and future research directions.

Creator