(674g) End-to-End Control Policy Learning through Implicit Differentiation with World Models
AIChE Annual Meeting
2024
2024 AIChE Annual Meeting
Computing and Systems Technology Division
10B: AI/ML Modeling, Optimization and Control Applications I
Thursday, October 31, 2024 - 2:06pm to 2:22pm
To address the above challenges, we propose an optimization-based controller with feedback from experimentally accessible measurements for nonequilibrium systems where the dynamics and exact functional form of the performance objective are unknown. Optimization-based control policies such as model predictive control (MPC) generate sequences of optimal control inputs that minimize a cost function based on an underlying model [7], presenting a number of benefits as compared to black-box NN policies: improved interpretability, a reduction in tunable parameters, constraint enforcement, and efficient state-space navigation through planning [8]. Learning an implicit optimization-based policy requires overcoming two problems: (i) system identification, the problem of learning unknown dynamics; and (ii) inverse optimal control, the problem of learning the control objective function [9]. This has been accomplished through RL frameworks, often employing estimates of policy gradients from data [10], or Bayesian optimization, constructing and optimizing a surrogate objective from data [11]. However, advances in differentiable optimization and machine learning open new doors for end-to-end learning of implicitly defined control policies [12], while embodying the notion of identification for control (I4C) [13]. By computing analytical derivatives of trajectories that solve the inner optimal control problem with respect to the underlying policy parameters and applying Diniâs implicit function theorem, a control policy may be updated with gradient information directly from the task loss [14]. However, optimizing control policies in the real world can be expensive; instead, one may desire to optimize a policy inside of a simulated latent space âdream world.â World models offer a solution to offline policy optimization that are trained incrementally to simulate reality for transferring policies back to the real world [15]. As a fully-differentiable computation graph, world models enable policy optimization in the simulated latent space directly using backpropagation in order to maximize the performance objective.
In this work, we demonstrate the use of differentiable optimization in learning optimal control policies for the Ising system through world models. Differentiable optimization has regularly been seen in the context of imitation learning, where policy learning assumes access to expert demonstrations for mimicking, rather than learning to maximize a prespecified reward [16]. Its applications are less studied in problems with sparse rewards, such as thermodynamic measurements measured on the scale of trajectories in the Ising model. However, by learning and backpropogating through a world model relating action sequences to thermodynamic costs, we can tune the control policy in a performance-oriented manner with gradient-information. We devise a two-stage, iterative learning procedure comprised of: (i) system identification of the policy dynamics model and world model based on interaction with the real system, and (ii) offline, gradient-based policy optimization through âimaginedâ latent trajectories in the fully-differentiable world model without interacting with the real system. Effectively interleaving these two stages, such an approach to learning yields an interpretable controller capable of constraint enforcement, whose components can be optimized without additional interaction with the real system according to knowledge retained within the world model. In comparison with derivative-free methods, we investigate the performance of the learned control policy, along with its sample complexity, to obtain magnetization reversal of the Ising model with minimal energy dissipation.
[1] U. Seifert, âStochastic thermodynamics, fluctuation theorems and molecular machines,â Reports on Progress in Physics, vol. 75, no. 12, p. 126001, 2012.
[2] T. Sagawa and M. Ueda, âNonequilibrium thermodynamics of feedback control,â Phys. Rev. E, vol. 85, p. 021104, 2012.
[3] L. Onsager, âCrystal statistics. i. a two-dimensional model with an order-disorder transition,â Phys. Rev., vol. 65, pp. 117â149, 1944.
[4] M. C. Engel, J. A. Smith, and M. P. Brenner, âOptimal control of nonequilibrium systems through automatic differentiation,â Phys. Rev. X, vol. 13, p. 041032, 2023.
[5] S. Whitelam, âDemon in the machine: Learning to extract work and absorb entropy from fluctuating nanosystems,â Phys. Rev. X, vol. 13, p. 021005,
2023.
[6] B. Recht, âA tour of reinforcement learning: The view from continuous control,â Annual Review of Control, Robotics, and Autonomous Systems, vol. 2, pp. 253â279, 2019.
[7] J. Rawlings, D. Mayne, and M. Diehl, Model Predictive Control: Theory, Computation, and Design. Nob Hill Publishing, 2017.
[8] S. Levine and V. Koltun, âGuided Policy Search,â in Proceedings of the 30th International Conference on Machine Learning, vol. 28, no. 3. PMLR,
2013, pp. 1â9.
[9] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger, âLearning-based model predictive control: Toward safe learning in control,â Annual
Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 269â296, 2020.
[10] A. B. Kordabad, D. Reinhardt, A. S. Anand, and S. Gros, âReinforcement Learning for MPC: Fundamentals and Current Challenges,â IFAC-PapersOnLine, vol. 56, no. 2, pp. 5773â5780, 2023.
[11] J. A. Paulson, F. Sorourifar, and A. Mesbah, âA Tutorial on Derivative-Free Policy Learning Methods for Interpretable Controller Representations,â in
2023 American Control Conference (ACC), 2023, pp. 1295â1306.
[12] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter, âDifferentiable MPC for End-to-end Planning and Control,â in Advances in Neural Information
Processing Systems, vol. 31. Curran Associates, Inc., 2018, p. 8299â8310.
[13] M. Gevers, âIdentification for control: From the early achievements to the revival of experiment design,â European Journal of Control, vol. 11, no. 4-5,
pp. 335â352, 2005.
[14] M. Xu, T. Molloy, and S. Gould, âRevisiting implicit differentiation for learning problems in optimal control,â 2023.
[15] D. Ha and J. Schmidhuber, âWorld models,â arXiv preprint arXiv:1803.10122, 2018.
[16] W. Jin, Z. Wang, Z. Yang, and S. Mou, âPontryagin Differentiable Programming: An End-to-End Learning and Control Framework,â vol. 33, pp. 7979â7992, 2020.