(674g) End-to-End Control Policy Learning through Implicit Differentiation with World Models | AIChE

(674g) End-to-End Control Policy Learning through Implicit Differentiation with World Models

Authors 

Banker, T. - Presenter, The Ohio State University
Mesbah, A., University of California, Berkeley
Understanding energy flows in nanoscale systems is central to biology and nanotechnology. Molecular machines can drive nonequilibrium processes at these scales, but according to the second law of thermodynamics, externally driven control of a fluctuating molecular machine comes with a thermodynamic cost, which can be meaningfully defined only at the level of individual trajectories [1]. This presents an optimization problem in designing energetically efficient microscopic machines capable of driving a system between initial and final states at minimal cost. Through the use of feedback, work can be extracted from the system, or entropy may be absorbed by it [2]. Consider the Ising model, a mathematical model useful towards understanding nonequilibrium phase transitions and magnetization reversal [3]. The objective is to obtain magnetization reversal with minimal energy dissipation, which is compounded by the nonlinear and stochastic dynamics of the Ising model. Prior approaches to learning optimal control policies assume full knowledge of the dynamics in order to obtain gradients of the objective with fully-differentiable simulations [4]. However, even when the equations of motion are known, such an approach may not always be applicable as nonequilibrium modeling is notoriously difficult, let alone fully-differentiable modeling. Alternative approaches learn neural network (NN) policies in a derivative-free manner such that learning procedures can be applied to experiments with sparse thermodynamic measurements the same way they are applied to simulations [5]. However, model-free reinforcement learning (RL) approaches are known to suffer from poor data-efficiency and may only be trained through interaction between the control policy and system [6].

To address the above challenges, we propose an optimization-based controller with feedback from experimentally accessible measurements for nonequilibrium systems where the dynamics and exact functional form of the performance objective are unknown. Optimization-based control policies such as model predictive control (MPC) generate sequences of optimal control inputs that minimize a cost function based on an underlying model [7], presenting a number of benefits as compared to black-box NN policies: improved interpretability, a reduction in tunable parameters, constraint enforcement, and efficient state-space navigation through planning [8]. Learning an implicit optimization-based policy requires overcoming two problems: (i) system identification, the problem of learning unknown dynamics; and (ii) inverse optimal control, the problem of learning the control objective function [9]. This has been accomplished through RL frameworks, often employing estimates of policy gradients from data [10], or Bayesian optimization, constructing and optimizing a surrogate objective from data [11]. However, advances in differentiable optimization and machine learning open new doors for end-to-end learning of implicitly defined control policies [12], while embodying the notion of identification for control (I4C) [13]. By computing analytical derivatives of trajectories that solve the inner optimal control problem with respect to the underlying policy parameters and applying Dini’s implicit function theorem, a control policy may be updated with gradient information directly from the task loss [14]. However, optimizing control policies in the real world can be expensive; instead, one may desire to optimize a policy inside of a simulated latent space “dream world.” World models offer a solution to offline policy optimization that are trained incrementally to simulate reality for transferring policies back to the real world [15]. As a fully-differentiable computation graph, world models enable policy optimization in the simulated latent space directly using backpropagation in order to maximize the performance objective.

In this work, we demonstrate the use of differentiable optimization in learning optimal control policies for the Ising system through world models. Differentiable optimization has regularly been seen in the context of imitation learning, where policy learning assumes access to expert demonstrations for mimicking, rather than learning to maximize a prespecified reward [16]. Its applications are less studied in problems with sparse rewards, such as thermodynamic measurements measured on the scale of trajectories in the Ising model. However, by learning and backpropogating through a world model relating action sequences to thermodynamic costs, we can tune the control policy in a performance-oriented manner with gradient-information. We devise a two-stage, iterative learning procedure comprised of: (i) system identification of the policy dynamics model and world model based on interaction with the real system, and (ii) offline, gradient-based policy optimization through “imagined” latent trajectories in the fully-differentiable world model without interacting with the real system. Effectively interleaving these two stages, such an approach to learning yields an interpretable controller capable of constraint enforcement, whose components can be optimized without additional interaction with the real system according to knowledge retained within the world model. In comparison with derivative-free methods, we investigate the performance of the learned control policy, along with its sample complexity, to obtain magnetization reversal of the Ising model with minimal energy dissipation.

[1] U. Seifert, “Stochastic thermodynamics, fluctuation theorems and molecular machines,” Reports on Progress in Physics, vol. 75, no. 12, p. 126001, 2012.
[2] T. Sagawa and M. Ueda, “Nonequilibrium thermodynamics of feedback control,” Phys. Rev. E, vol. 85, p. 021104, 2012.
[3] L. Onsager, “Crystal statistics. i. a two-dimensional model with an order-disorder transition,” Phys. Rev., vol. 65, pp. 117–149, 1944.
[4] M. C. Engel, J. A. Smith, and M. P. Brenner, “Optimal control of nonequilibrium systems through automatic differentiation,” Phys. Rev. X, vol. 13, p. 041032, 2023.
[5] S. Whitelam, “Demon in the machine: Learning to extract work and absorb entropy from fluctuating nanosystems,” Phys. Rev. X, vol. 13, p. 021005,
2023.
[6] B. Recht, “A tour of reinforcement learning: The view from continuous control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 2, pp. 253–279, 2019.
[7] J. Rawlings, D. Mayne, and M. Diehl, Model Predictive Control: Theory, Computation, and Design. Nob Hill Publishing, 2017.
[8] S. Levine and V. Koltun, “Guided Policy Search,” in Proceedings of the 30th International Conference on Machine Learning, vol. 28, no. 3. PMLR,
2013, pp. 1–9.
[9] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger, “Learning-based model predictive control: Toward safe learning in control,” Annual
Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 269–296, 2020.
[10] A. B. Kordabad, D. Reinhardt, A. S. Anand, and S. Gros, “Reinforcement Learning for MPC: Fundamentals and Current Challenges,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 5773–5780, 2023.
[11] J. A. Paulson, F. Sorourifar, and A. Mesbah, “A Tutorial on Derivative-Free Policy Learning Methods for Interpretable Controller Representations,” in
2023 American Control Conference (ACC), 2023, pp. 1295–1306.
[12] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter, “Differentiable MPC for End-to-end Planning and Control,” in Advances in Neural Information
Processing Systems, vol. 31. Curran Associates, Inc., 2018, p. 8299–8310.
[13] M. Gevers, “Identification for control: From the early achievements to the revival of experiment design,” European Journal of Control, vol. 11, no. 4-5,
pp. 335–352, 2005.
[14] M. Xu, T. Molloy, and S. Gould, “Revisiting implicit differentiation for learning problems in optimal control,” 2023.
[15] D. Ha and J. Schmidhuber, “World models,” arXiv preprint arXiv:1803.10122, 2018.
[16] W. Jin, Z. Wang, Z. Yang, and S. Mou, “Pontryagin Differentiable Programming: An End-to-End Learning and Control Framework,” vol. 33, pp. 7979–7992, 2020.